chore: Regenerate all playbooks

2026-06-18 04:22:21 +00:00 · 2025-10-08 22:00:07 +00:00 · 2025-10-08 22:00:07 +00:00 · b3a97461df
commit b3a97461df
parent 3aea2df880
24 changed files with 770 additions and 137 deletions
--- a/nvidia/comfy-ui/README.md
+++ b/nvidia/comfy-ui/README.md
@ -58,14 +58,15 @@ All required assets can be found [in the ComfyUI repository on GitHub](https://g

 ## Time & risk

-**Estimated time:** 30-45 minutes (including model download)
-
-**Risk level:** Medium
- Model downloads are large (~2GB) and may fail due to network issues
- Port 8188 must be accessible for web interface functionality
-
-**Rollback:** Virtual environment can be deleted to remove all installed packages. Downloaded models 
-can be removed manually from the checkpoints directory.
+* **Estimated time:** 30-45 minutes (including model download)
+* **Risk level:** Medium
+  * Model downloads are large (~2GB) and may fail due to network issues
+  * Port 8188 must be accessible for web interface functionality
+* **Rollback:** Virtual environment can be deleted to remove all installed packages. Downloaded models can be removed manually from the checkpoints directory.
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions

--- a/nvidia/dgx-dashboard/README.md
+++ b/nvidia/dgx-dashboard/README.md
@ -36,11 +36,13 @@ You will learn how to access and use the DGX Dashboard on your DGX Spark device.

 ## Time & risk

-**Duration:** 15-30 minutes for complete walkthrough including sample AI workload
-
-**Risk level:** Low - Web interface operations with minimal system impact
-
-**Rollback:** Stop JupyterLab instances through dashboard interface; no permanent system changes made during normal usage.
+* **Duration:** 15-30 minutes for complete walkthrough including sample AI workload
+* **Risk level:** Low - Web interface operations with minimal system impact
+* **Rollback:** Stop JupyterLab instances through dashboard interface; no permanent system changes made during normal usage.
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions

--- a/nvidia/flux-finetuning/README.md
+++ b/nvidia/flux-finetuning/README.md
@ -39,15 +39,17 @@ The setup includes:

 ## Time & risk

-**Duration**:
- 30-45 minutes for initial setup model download time
- 1-2 hours for dreambooth LoRA training
-
-**Risks**:
- Docker permission issues may require user group changes and session restart
- The recipe would require hyperparameter tuning and a high-quality dataset for the best results
-
+* **Duration**:
+  * 30-45 minutes for initial setup model download time
+  * 1-2 hours for dreambooth LoRA training
+* **Risks**:
+  * Docker permission issues may require user group changes and session restart
+  * The recipe would require hyperparameter tuning and a high-quality dataset for the best results
 **Rollback**: Stop and remove Docker containers, delete downloaded models if needed.
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions

--- a/nvidia/jax/README.md
+++ b/nvidia/jax/README.md
@ -59,13 +59,15 @@ All required assets can be found [here on GitHub](https://gitlab.com/nvidia/dgx-

 ## Time & risk

-**Duration:** 2-3 hours including setup, tutorial completion, and validation
-
-**Risks:** 
- Package dependency conflicts in Python environment
- Performance validation may require architecture-specific optimizations
-
+* **Duration:** 2-3 hours including setup, tutorial completion, and validation
+* **Risks:** 
+  * Package dependency conflicts in Python environment
+  * Performance validation may require architecture-specific optimizations
 **Rollback:** Container environments provide isolation; remove containers and restart to reset state.
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions

--- a/nvidia/llama-factory/README.md
+++ b/nvidia/llama-factory/README.md
@ -63,14 +63,13 @@ model adaptation for specialized domains while leveraging hardware-specific opti

 ## Time & risk

-**Duration:** 30-60 minutes for initial setup, 1-7 hours for training depending on model size
-and dataset.
-
-**Risks:** Model downloads require significant bandwidth and storage. Training may consume 
-substantial GPU memory and require parameter tuning for hardware constraints.
-
-**Rollback:** Remove Docker containers and cloned repositories. Training checkpoints are 
-saved locally and can be deleted to reclaim storage space.
+* **Duration:** 30-60 minutes for initial setup, 1-7 hours for training depending on model size and dataset.
+* **Risks:** Model downloads require significant bandwidth and storage. Training may consume substantial GPU memory and require parameter tuning for hardware constraints.
+* **Rollback:** Remove Docker containers and cloned repositories. Training checkpoints are saved locally and can be deleted to reclaim storage space.
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions

--- a/nvidia/multi-agent-chatbot/README.md
+++ b/nvidia/multi-agent-chatbot/README.md
@ -42,13 +42,15 @@ The setup includes:

 ## Time & risk

-**Estimated time**: 30 minutes to an hour
-
-**Risks**:
- Docker permission issues may require user group changes and session restart
- Setup includes downloading model files for gpt-oss-120B (~63GB), Deepseek-Coder:6.7B-Instruct (~7GB) and Qwen3-Embedding-4B (~4GB), which may take between 30 minutes to 2 hours depending on network speed
-
-**Rollback**: Stop and remove Docker containers using provided cleanup commands.
+* **Estimated time**: 30 minutes to an hour
+* **Risks**:
+  * Docker permission issues may require user group changes and session restart
+  * Setup includes downloading model files for gpt-oss-120B (~63GB), Deepseek-Coder:6.7B-Instruct (~7GB) and Qwen3-Embedding-4B (~4GB), which may take between 30 minutes to 2 hours depending on network speed
+* **Rollback**: Stop and remove Docker containers using provided cleanup commands.
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions

--- a/nvidia/nemo-fine-tune/README.md
+++ b/nvidia/nemo-fine-tune/README.md
@ -43,11 +43,13 @@ All necessary files for the playbook can be found [here on GitHub](https://githu

 ## Time & risk

-**Duration:** 45-90 minutes for complete setup and initial model fine-tuning
-
-**Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting, distributed training setup complexity increases with multi-node configurations
-
-**Rollback:** Virtual environments can be completely removed; no system-level changes are made to the host system beyond package installations.
+* **Duration:** 45-90 minutes for complete setup and initial model fine-tuning
+* **Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting, distributed training setup complexity increases with multi-node configurations
+* **Rollback:** Virtual environments can be completely removed; no system-level changes are made to the host system beyond package installations.
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions

--- a/nvidia/nim-llm/README.md
+++ b/nvidia/nim-llm/README.md
@ -60,14 +60,16 @@ completions.

 ### Time & risk

-**Estimated time:** 15-30 minutes for setup and validation
-
-**Risks:**
- Large model downloads may take significant time depending on network speed
- GPU memory requirements vary by model size
- Container startup time depends on model loading
-
-**Rollback:** Stop and remove containers with `docker stop <CONTAINER_NAME> && docker rm <CONTAINER_NAME>`. Remove cached models from `~/.cache/nim` if disk space recovery is needed.
+* **Estimated time:** 15-30 minutes for setup and validation
+* **Risks:**
+  * Large model downloads may take significant time depending on network speed
+  * GPU memory requirements vary by model size
+  * Container startup time depends on model loading
+* **Rollback:** Stop and remove containers with `docker stop <CONTAINER_NAME> && docker rm <CONTAINER_NAME>`. Remove cached models from `~/.cache/nim` if disk space recovery is needed.
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions

--- a/nvidia/nvfp4-quantization/README.md
+++ b/nvidia/nvfp4-quantization/README.md
@ -58,14 +58,16 @@ df -h .

 ## Time & risk

-**Estimated duration**: 45-90 minutes depending on network speed and model size
-
-**Risks**:
- Model download may fail due to network issues or Hugging Face authentication problems
- Quantization process is memory-intensive and may fail on systems with insufficient GPU memory
- Output files are large (several GB) and require adequate storage space
-
-**Rollback**: Remove the output directory and any pulled Docker images to restore original state.
+* **Estimated duration**: 45-90 minutes depending on network speed and model size
+* **Risks**:
+  * Model download may fail due to network issues or Hugging Face authentication problems
+  * Quantization process is memory-intensive and may fail on systems with insufficient GPU memory
+  * Output files are large (several GB) and require adequate storage space
+* **Rollback**: Remove the output directory and any pulled Docker images to restore original state.
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions

--- a/nvidia/open-webui/README.md
+++ b/nvidia/open-webui/README.md
@ -38,13 +38,15 @@ for model management, persistent data storage, and GPU acceleration for model in

 ## Time & risk

-**Duration**: 15-20 minutes for initial setup, plus model download time (varies by model size)
-
-**Risks**:
- Docker permission issues may require user group changes and session restart
- Large model downloads may take significant time depending on network speed
-
-**Rollback**: Stop and remove Docker containers using provided cleanup commands, remove custom port from NVIDIA Sync settings.
+* **Duration**: 15-20 minutes for initial setup, plus model download time (varies by model size)
+* **Risks**:
+  * Docker permission issues may require user group changes and session restart
+  * Large model downloads may take significant time depending on network speed
+* **Rollback**: Stop and remove Docker containers using provided cleanup commands, remove custom port from NVIDIA Sync settings.
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions

--- a/nvidia/pytorch-fine-tune/README.md
+++ b/nvidia/pytorch-fine-tune/README.md
@ -37,9 +37,12 @@ ALl files required for fine-tuning are included in the folder in [the GitHub rep

 ## Time & risk

-**Time estimate:** 30-45 mins for setup and runing fine-tuning. Fine-tuning run time varies depending on model size 
-
-**Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting.
+* **Time estimate:** 30-45 mins for setup and runing fine-tuning. Fine-tuning run time varies depending on model size 
+* **Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting.
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions

--- a/nvidia/pytorch-fine-tune/assets/Llama3_3B_full_finetuning.py
+++ b/nvidia/pytorch-fine-tune/assets/Llama3_3B_full_finetuning.py
@ -0,0 +1,195 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import torch
+import argparse
+from datasets import load_dataset
+from trl import SFTConfig, SFTTrainer
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+
+# Define prompt templates
+ALPACA_PROMPT_TEMPLATE = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
+### Instruction: {}
+
+### Input: {}
+
+### Response: {}"""
+
+def get_alpaca_dataset(eos_token, dataset_size=500):
+    # Preprocess the dataset
+    def preprocess(x):
+        texts = [
+            ALPACA_PROMPT_TEMPLATE.format(instruction, input, output) + eos_token
+            for instruction, input, output in zip(x["instruction"], x["input"], x["output"])
+        ]
+        return {"text": texts}
+
+    dataset = load_dataset("tatsu-lab/alpaca", split="train").select(range(dataset_size)).shuffle(seed=42)
+    return dataset.map(preprocess, remove_columns=dataset.column_names, batched=True)
+
+
+def main(args):
+    # Load the model and tokenizer
+    print(f"Loading model: {args.model_name}")
+    model = AutoModelForCausalLM.from_pretrained(
+        args.model_name,
+        dtype=args.dtype,
+        device_map="auto",
+        trust_remote_code=True
+    )
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name, trust_remote_code=True)
+    tokenizer.pad_token = tokenizer.eos_token
+
+    # Print model information
+    total_params = sum(p.numel() for p in model.parameters())
+    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    print(f"Total parameters: {total_params:,}")
+    print(f"Trainable parameters: {trainable_params:,} (100% - Full Fine-tuning)")
+
+    # Load and preprocess the dataset
+    print(f"Loading dataset with {args.dataset_size} samples...")
+    dataset = get_alpaca_dataset(tokenizer.eos_token, args.dataset_size)
+
+    # Configure the SFT config
+    config = {
+        "per_device_train_batch_size": args.batch_size,
+        "num_train_epochs": 0.01,  # Warmup epoch
+        "gradient_accumulation_steps": args.gradient_accumulation_steps,
+        "learning_rate": args.learning_rate,
+        "optim": "adamw_torch",
+        "save_strategy": 'no',
+        "remove_unused_columns": False,
+        "seed": 42,
+        "dataset_text_field": "text",
+        "packing": False,
+        "max_seq_length": args.seq_length,
+        "torch_compile": False,
+        "report_to": "none",
+        "logging_dir": args.log_dir,
+        "logging_steps": args.logging_steps,
+        "gradient_checkpointing": args.gradient_checkpointing,  # Save memory
+    }
+
+    # Compile model if requested
+    if args.use_torch_compile:
+        print("Compiling model with torch.compile()...")
+        model = torch.compile(model)
+
+        # Warmup for torch compile
+        print("Running warmup for torch.compile()...")
+        SFTTrainer(
+            model=model,
+            processing_class=tokenizer,
+            train_dataset=dataset,
+            args=SFTConfig(**config),
+        ).train()
+
+    # Train the model
+    print(f"\nStarting full fine-tuning for {args.num_epochs} epoch(s)...")
+    config["num_train_epochs"] = args.num_epochs
+    config["report_to"] = "tensorboard"
+
+    trainer = SFTTrainer(
+        model=model,
+        processing_class=tokenizer,
+        train_dataset=dataset,
+        args=SFTConfig(**config),
+    )
+
+    trainer_stats = trainer.train()
+
+    # Print training statistics
+    print(f"\n{'='*60}")
+    print("TRAINING COMPLETED")
+    print(f"{'='*60}")
+    print(f"Training runtime: {trainer_stats.metrics['train_runtime']:.2f} seconds")
+    print(f"Samples per second: {trainer_stats.metrics['train_samples_per_second']:.2f}")
+    print(f"Steps per second: {trainer_stats.metrics['train_steps_per_second']:.2f}")
+    print(f"Train loss: {trainer_stats.metrics['train_loss']:.4f}")
+    print(f"{'='*60}\n")
+
+    # Save model if requested
+    if args.output_dir:
+        print(f"Saving model to {args.output_dir}...")
+        trainer.save_model(args.output_dir)
+        tokenizer.save_pretrained(args.output_dir)
+        print("Model saved successfully!")
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser(description="Llama 3.2 3B Full Fine-tuning (SFT)")
+
+    # Model configuration
+    parser.add_argument("--model_name", type=str, default="meta-llama/Llama-3.2-3B-Instruct",
+                        help="Model name or path")
+    parser.add_argument("--dtype", type=str, default="bfloat16",
+                        choices=["float32", "float16", "bfloat16"],
+                        help="Model dtype")
+
+    # Training configuration
+    parser.add_argument("--batch_size", type=int, default=8,
+                        help="Per device training batch size")
+    parser.add_argument("--seq_length", type=int, default=2048,
+                        help="Maximum sequence length")
+    parser.add_argument("--num_epochs", type=int, default=1,
+                        help="Number of training epochs")
+    parser.add_argument("--gradient_accumulation_steps", type=int, default=1,
+                        help="Gradient accumulation steps")
+    parser.add_argument("--learning_rate", type=float, default=5e-5,
+                        help="Learning rate")
+    parser.add_argument("--gradient_checkpointing", action="store_true",
+                        help="Enable gradient checkpointing to save memory")
+
+    # Dataset configuration
+    parser.add_argument("--dataset_size", type=int, default=500,
+                        help="Number of samples to use from dataset")
+
+    # Logging configuration
+    parser.add_argument("--logging_steps", type=int, default=1,
+                        help="Log every N steps")
+    parser.add_argument("--log_dir", type=str, default="logs",
+                        help="Directory for logs")
+
+    # Compilation and saving
+    parser.add_argument("--use_torch_compile", action="store_true",
+                        help="Use torch.compile() for faster training")
+    parser.add_argument("--output_dir", type=str, default=None,
+                        help="Directory to save the fine-tuned model")
+
+    return parser.parse_args()
+
+
+if __name__ == "__main__":
+    args = parse_arguments()
+    print(f"\n{'='*60}")
+    print("LLAMA 3.2 3B FULL FINE-TUNING CONFIGURATION")
+    print(f"{'='*60}")
+    print(f"Model: {args.model_name}")
+    print(f"Training mode: Full SFT ")
+    print(f"Batch size: {args.batch_size}")
+    print(f"Gradient accumulation: {args.gradient_accumulation_steps}")
+    print(f"Effective batch size: {args.batch_size * args.gradient_accumulation_steps}")
+    print(f"Sequence length: {args.seq_length}")
+    print(f"Number of epochs: {args.num_epochs}")
+    print(f"Learning rate: {args.learning_rate}")
+    print(f"Dataset size: {args.dataset_size}")
+    print(f"Gradient checkpointing: {args.gradient_checkpointing}")
+    print(f"Torch compile: {args.use_torch_compile}")
+    print(f"{'='*60}\n")
+
+    main(args)
--- a/nvidia/pytorch-fine-tune/assets/Llama3_70B_qLoRA_finetuning.py
+++ b/nvidia/pytorch-fine-tune/assets/Llama3_70B_qLoRA_finetuning.py
@ -0,0 +1,228 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import torch
+import argparse
+import os
+from datasets import load_dataset
+from trl import SFTConfig, SFTTrainer
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+from peft import get_peft_model, LoraConfig, TaskType, prepare_model_for_kbit_training
+
+
+# Define prompt templates
+ALPACA_PROMPT_TEMPLATE = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
+### Instruction: {}
+
+### Input: {}
+
+### Response: {}"""
+
+def get_alpaca_dataset(eos_token, dataset_size=500):
+    # Preprocess the dataset
+    def preprocess(x):
+        texts = [
+            ALPACA_PROMPT_TEMPLATE.format(instruction, input, output) + eos_token
+            for instruction, input, output in zip(x["instruction"], x["input"], x["output"])
+        ]
+        return {"text": texts}
+
+    dataset = load_dataset("tatsu-lab/alpaca", split="train").select(range(dataset_size)).shuffle(seed=42)
+    return dataset.map(preprocess, remove_columns=dataset.column_names, batched=True)
+
+
+def main(args):
+    # Load the model and tokenizer
+    print(f"Loading model: {args.model_name}")
+    print(f"Training mode: QLoRA (4-bit quantization)")
+    
+    # Use balanced device map for QLoRA to avoid device placement issues
+    # "balanced" distributes model across available GPUs more reliably than "auto"
+    device_map_config = "balanced" if torch.cuda.device_count() > 1 else {"": 0}
+    
+    # Configure 4-bit quantization for QLoRA
+    quantization_config = BitsAndBytesConfig(
+        load_in_4bit=True,
+        bnb_4bit_use_double_quant=True,
+        bnb_4bit_quant_type='nf4',
+        bnb_4bit_compute_dtype=getattr(torch, args.dtype),
+    )
+    
+    model = AutoModelForCausalLM.from_pretrained(
+        args.model_name,
+        quantization_config=quantization_config,
+        dtype=args.dtype,
+        device_map=device_map_config,
+        trust_remote_code=True
+    )
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name, trust_remote_code=True)
+    tokenizer.pad_token = tokenizer.eos_token
+
+    # Prepare model for QLoRA training
+    print(f"Preparing model for QLoRA (4-bit) with rank {args.lora_rank}...")
+    model = prepare_model_for_kbit_training(model)
+    
+    model = get_peft_model(model, LoraConfig(
+        r=args.lora_rank,
+        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
+        lora_alpha=16,
+        lora_dropout=0,
+        task_type=TaskType.CAUSAL_LM
+    ))
+    
+    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    total_params = sum(p.numel() for p in model.parameters())
+    print(f"Trainable parameters: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
+
+    # Load and preprocess the dataset
+    print(f"Loading dataset with {args.dataset_size} samples...")
+    dataset = get_alpaca_dataset(tokenizer.eos_token, args.dataset_size)
+
+    # Configure the SFT config
+    config = {
+        "per_device_train_batch_size": args.batch_size,
+        "num_train_epochs": 0.01,  # Warmup epoch
+        "gradient_accumulation_steps": args.gradient_accumulation_steps,
+        "learning_rate": args.learning_rate,
+        "optim": "adamw_torch",
+        "save_strategy": 'no',
+        "remove_unused_columns": False,
+        "seed": 42,
+        "dataset_text_field": "text",
+        "packing": False,
+        "max_seq_length": args.seq_length,
+        "torch_compile": False,
+        "report_to": "none",
+        "logging_dir": args.log_dir,
+        "logging_steps": args.logging_steps,
+        "gradient_checkpointing": args.gradient_checkpointing
+    }
+
+    # Compile model if requested
+    if args.use_torch_compile:
+        print("Compiling model with torch.compile()...")
+        model = torch.compile(model)
+        
+        # Warmup for torch compile
+        print("Running warmup for torch.compile()...")
+        SFTTrainer(
+            model=model,
+            processing_class=tokenizer,
+            train_dataset=dataset,
+            args=SFTConfig(**config),
+        ).train()
+
+    # Train the model
+    print(f"\nStarting QLoRA fine-tuning for {args.num_epochs} epoch(s)...")
+    config["num_train_epochs"] = args.num_epochs
+    config["report_to"] = "tensorboard"
+    
+    trainer = SFTTrainer(
+        model=model,
+        processing_class=tokenizer,
+        train_dataset=dataset,
+        args=SFTConfig(**config),
+    )
+    
+    trainer_stats = trainer.train()
+    
+    # Print training statistics
+    print(f"\n{'='*60}")
+    print("TRAINING COMPLETED")
+    print(f"{'='*60}")
+    print(f"Training runtime: {trainer_stats.metrics['train_runtime']:.2f} seconds")
+    print(f"Samples per second: {trainer_stats.metrics['train_samples_per_second']:.2f}")
+    print(f"Steps per second: {trainer_stats.metrics['train_steps_per_second']:.2f}")
+    print(f"Train loss: {trainer_stats.metrics['train_loss']:.4f}")
+    print(f"{'='*60}\n")
+    
+    # Save model if requested
+    if args.output_dir:
+        print(f"Saving model to {args.output_dir}...")
+        trainer.save_model(args.output_dir)
+        tokenizer.save_pretrained(args.output_dir)
+        print("Model saved successfully!")
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser(description="Llama 3.1 70B Fine-tuning with QLoRA")
+    
+    # Model configuration
+    parser.add_argument("--model_name", type=str, default="meta-llama/Llama-3.1-70B-Instruct",
+                        help="Model name or path")
+    parser.add_argument("--dtype", type=str, default="bfloat16",
+                        help="Model dtype (e.g., float32, float16, bfloat16)")
+    
+    # Training configuration
+    parser.add_argument("--batch_size", type=int, default=8,
+                        choices=[1, 2, 4, 8, 16, 32],
+                        help="Per device training batch size")
+    parser.add_argument("--seq_length", type=int, default=2048,
+                        choices=[256, 512, 1024, 2048, 4096, 8192],
+                        help="Maximum sequence length")
+    parser.add_argument("--num_epochs", type=int, default=1,
+                        help="Number of training epochs")
+    parser.add_argument("--gradient_accumulation_steps", type=int, default=1,
+                        help="Gradient accumulation steps")
+    parser.add_argument("--learning_rate", type=float, default=1e-4,
+                        help="Learning rate")
+    parser.add_argument("--gradient_checkpointing", action="store_true",
+                        help="Enable gradient checkpointing to save memory")
+    
+    # LoRA configuration
+    parser.add_argument("--lora_rank", type=int, default=8,
+                        help="LoRA rank")
+    
+    # Dataset configuration
+    parser.add_argument("--dataset_size", type=int, default=500,
+                        help="Number of samples to use from dataset")
+    
+    # Logging configuration
+    parser.add_argument("--logging_steps", type=int, default=1,
+                        help="Log every N steps")
+    parser.add_argument("--log_dir", type=str, default="logs",
+                        help="Directory for logs")
+    
+    # Compilation and saving
+    parser.add_argument("--use_torch_compile", action="store_true",
+                        help="Use torch.compile() for faster training")
+    parser.add_argument("--output_dir", type=str, default=None,
+                        help="Directory to save the fine-tuned model")
+    
+    return parser.parse_args()
+
+
+if __name__ == "__main__":
+    args = parse_arguments()
+    print(f"\n{'='*60}")
+    print("LLAMA 3.1 70B QLoRA FINE-TUNING")
+    print(f"{'='*60}")
+    print(f"Model: {args.model_name}")
+    print(f"Training mode: QLoRA (4-bit quantization)")
+    print(f"Batch size: {args.batch_size}")
+    print(f"Gradient accumulation: {args.gradient_accumulation_steps}")
+    print(f"Effective batch size: {args.batch_size * args.gradient_accumulation_steps}")
+    print(f"Sequence length: {args.seq_length}")
+    print(f"Number of epochs: {args.num_epochs}")
+    print(f"Learning rate: {args.learning_rate}")
+    print(f"LoRA rank: {args.lora_rank}")
+    print(f"Dataset size: {args.dataset_size}")
+    print(f"Gradient checkpointing: {args.gradient_checkpointing}")
+    print(f"Torch compile: {args.use_torch_compile}")
+    print(f"{'='*60}\n")
+    
+    main(args)
--- a/nvidia/pytorch-fine-tune/assets/Llama3_8B_LoRA_finetuning.py
+++ b/nvidia/pytorch-fine-tune/assets/Llama3_8B_LoRA_finetuning.py
@ -0,0 +1,176 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import torch
+import argparse
+from datasets import load_dataset
+from trl import SFTConfig, SFTTrainer
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import get_peft_model, LoraConfig, TaskType
+
+
+# Define prompt templates
+ALPACA_PROMPT_TEMPLATE = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
+### Instruction: {}
+
+### Input: {}
+
+### Response: {}"""
+
+def get_alpaca_dataset(eos_token, dataset_size=500):
+    # Preprocess the dataset
+    def preprocess(x):
+        texts = [
+            ALPACA_PROMPT_TEMPLATE.format(instruction, input, output) + eos_token
+            for instruction, input, output in zip(x["instruction"], x["input"], x["output"])
+        ]
+        return {"text": texts}
+
+    dataset = load_dataset("tatsu-lab/alpaca", split="train").select(range(dataset_size)).shuffle(seed=42)
+    return dataset.map(preprocess, remove_columns=dataset.column_names, batched=True)
+
+
+def main(args):
+    # Load the model and tokenizer
+    print(f"Loading model: {args.model_name}")
+    model = AutoModelForCausalLM.from_pretrained(
+        args.model_name,
+        dtype=args.dtype,
+        device_map="auto"
+    )
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name)
+    tokenizer.pad_token = tokenizer.eos_token
+
+    # Configure LoRA config
+    model = get_peft_model(model, LoraConfig(
+        r=args.lora_rank,
+        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
+        lora_alpha=16,
+        lora_dropout=0,
+        task_type=TaskType.CAUSAL_LM))
+    print(f"Trainable parameters = {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
+
+    # Load and preprocess the dataset
+    print(f"Loading dataset with {args.dataset_size} samples...")
+    dataset = get_alpaca_dataset(tokenizer.eos_token, args.dataset_size)
+
+    # Configure the SFT config
+    config = {
+        "per_device_train_batch_size": args.batch_size,
+        "num_train_epochs": 0.01,
+        "gradient_accumulation_steps": args.gradient_accumulation_steps,
+        "learning_rate": args.learning_rate,
+        "optim": "adamw_torch",
+        "save_strategy": 'no',
+        "remove_unused_columns": False,
+        "seed": 42,
+        "dataset_text_field": "text",
+        "packing": False,
+        "max_seq_length": args.seq_length,
+        "torch_compile": False,
+        "report_to": "none",
+        "logging_dir": args.log_dir,
+        "logging_steps": args.logging_steps
+    }
+
+    # Warmup for torch compile
+    model = torch.compile(model)
+    SFTTrainer(
+        model=model,
+        processing_class=tokenizer,
+        train_dataset=dataset,
+        args=SFTConfig(**config),
+    ).train()
+
+    # Train the model
+    print(f"\nStarting LoRA fine-tuning for {args.num_epochs} epoch(s)...")
+    config["num_train_epochs"] = args.num_epochs
+    config["report_to"] = "tensorboard"
+    trainer = SFTTrainer(
+        model=model,
+        processing_class=tokenizer,
+        train_dataset=dataset,
+        args=SFTConfig(**config),
+    )
+    
+    trainer_stats = trainer.train()
+    
+    # Print training statistics
+    print(f"\n{'='*60}")
+    print("TRAINING COMPLETED")
+    print(f"{'='*60}")
+    print(f"Training runtime: {trainer_stats.metrics['train_runtime']:.2f} seconds")
+    print(f"Samples per second: {trainer_stats.metrics['train_samples_per_second']:.2f}")
+    print(f"Steps per second: {trainer_stats.metrics['train_steps_per_second']:.2f}")
+    print(f"Train loss: {trainer_stats.metrics['train_loss']:.4f}")
+    print(f"{'='*60}\n")
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser(description="Llama 3.1 8B Fine-tuning with LoRA")
+    
+    # Model configuration
+    parser.add_argument("--model_name", type=str, default="meta-llama/Llama-3.1-8B-Instruct",
+                        help="Model name or path")
+    parser.add_argument("--dtype", type=str, default="bfloat16",
+                        choices=["float32", "float16", "bfloat16"],
+                        help="Model dtype")
+    
+    # Training configuration
+    parser.add_argument("--batch_size", type=int, default=4,
+                        help="Per device training batch size")
+    parser.add_argument("--seq_length", type=int, default=2048,
+                        help="Maximum sequence length")
+    parser.add_argument("--num_epochs", type=int, default=1,
+                        help="Number of training epochs")
+    parser.add_argument("--gradient_accumulation_steps", type=int, default=1,
+                        help="Gradient accumulation steps")
+    parser.add_argument("--learning_rate", type=float, default=1e-4,
+                        help="Learning rate")
+    
+    # LoRA configuration
+    parser.add_argument("--lora_rank", type=int, default=8,
+                        help="LoRA rank")
+    
+    # Dataset configuration
+    parser.add_argument("--dataset_size", type=int, default=500,
+                        help="Number of samples to use from dataset")
+    
+    # Logging configuration
+    parser.add_argument("--logging_steps", type=int, default=1,
+                        help="Log every N steps")
+    parser.add_argument("--log_dir", type=str, default="logs",
+                        help="Directory for logs")
+    
+    return parser.parse_args()
+
+
+if __name__ == "__main__":
+    args = parse_arguments()
+    print(f"\n{'='*60}")
+    print("LLAMA 3.1 8B LoRA FINE-TUNING CONFIGURATION")
+    print(f"{'='*60}")
+    print(f"Model: {args.model_name}")
+    print(f"Batch size: {args.batch_size}")
+    print(f"Sequence length: {args.seq_length}")
+    print(f"Number of epochs: {args.num_epochs}")
+    print(f"Learning rate: {args.learning_rate}")
+    print(f"LoRA rank: {args.lora_rank}")
+    print(f"Dataset size: {args.dataset_size}")
+    print(f"{'='*60}\n")
+    
+    main(args)
--- a/nvidia/pytorch-fine-tune/assets/example
+++ b/nvidia/pytorch-fine-tune/assets/example
--- a/nvidia/rag-ai-workbench/README.md
+++ b/nvidia/rag-ai-workbench/README.md
@ -50,12 +50,13 @@ architectures.

 ## Time & risk

-**Estimated time:** 30-45 minutes (including AI Workbench installation if needed)
-
-**Risk level:** Low - Uses pre-built containers and established APIs
-
-**Rollback:** Simply delete the cloned project from AI Workbench to remove all components. No system
-changes are made outside the AI Workbench environment.
+* **Estimated time:** 30-45 minutes (including AI Workbench installation if needed)
+* **Risk level:** Low - Uses pre-built containers and established APIs
+* **Rollback:** Simply delete the cloned project from AI Workbench to remove all components. No system changes are made outside the AI Workbench environment.
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions

--- a/nvidia/speculative-decoding/README.md
+++ b/nvidia/speculative-decoding/README.md
@ -52,11 +52,13 @@ These examples demonstrate how to accelerate large language model inference whil

 ## Time & risk

-**Duration:** 10-20 minutes for setup, additional time for model downloads (varies by network speed)
-
-**Risks:** GPU memory exhaustion with large models, container registry access issues, network timeouts during downloads
-
-**Rollback:** Stop Docker containers and optionally clean up downloaded model cache.
+* **Duration:** 10-20 minutes for setup, additional time for model downloads (varies by network speed)
+* **Risks:** GPU memory exhaustion with large models, container registry access issues, network timeouts during downloads
+* **Rollback:** Stop Docker containers and optionally clean up downloaded model cache.
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions

--- a/nvidia/tailscale/README.md
+++ b/nvidia/tailscale/README.md
@ -62,15 +62,16 @@ all traffic automatically encrypted and NAT traversal handled transparently.

 ## Time & risk

-**Duration**: 15-30 minutes for initial setup, 5 minutes per additional device
-
-**Risks**:
- Potential SSH service configuration conflicts
- Network connectivity issues during initial setup
- Authentication provider service dependencies
-
-**Rollback**: Tailscale can be completely removed with `sudo apt remove tailscale`
-and all network routing automatically reverts to default settings.
+* **Duration**: 15-30 minutes for initial setup, 5 minutes per additional device
+* **Risks**:
+  * Potential SSH service configuration conflicts
+  * Network connectivity issues during initial setup
+  * Authentication provider service dependencies
+* **Rollback**: Tailscale can be completely removed with `sudo apt remove tailscale` and all network routing automatically reverts to default settings.
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions

--- a/nvidia/trt-llm/README.md
+++ b/nvidia/trt-llm/README.md
@ -110,11 +110,13 @@ Reminder: not all model architectures are supported for NVFP4 quantization.

 ## Time & risk

-**Duration**: 45-60 minutes for setup and API server deployment
-
-**Risk level**: Medium - container pulls and model downloads may fail due to network issues
-
-**Rollback**: Stop inference servers and remove downloaded models to free resources.
+* **Duration**: 45-60 minutes for setup and API server deployment
+* **Risk level**: Medium - container pulls and model downloads may fail due to network issues
+* **Rollback**: Stop inference servers and remove downloaded models to free resources.
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Single Spark

--- a/nvidia/unsloth/README.md
+++ b/nvidia/unsloth/README.md
@ -48,15 +48,16 @@ The Python test script can be found [here on GitHub](https://gitlab.com/nvidia/d

 ## Time & risk

-**Duration**: 30-60 minutes for initial setup and test run
-
-**Risks**: 
-
- Triton compiler version mismatches may cause compilation errors
- CUDA toolkit configuration issues may prevent kernel compilation
- Memory constraints on smaller models require batch size adjustments
-
-**Rollback**: Uninstall packages with `pip uninstall unsloth torch torchvision`.
+* **Duration**: 30-60 minutes for initial setup and test run
+* **Risks**: 
+  * Triton compiler version mismatches may cause compilation errors
+  * CUDA toolkit configuration issues may prevent kernel compilation
+  * Memory constraints on smaller models require batch size adjustments
+* **Rollback**: Uninstall packages with `pip uninstall unsloth torch torchvision`.
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions

--- a/nvidia/vllm/README.md
+++ b/nvidia/vllm/README.md
@ -48,11 +48,13 @@ support for ARM64.

 ## Time & risk

-**Duration:** 30 minutes for Docker approach
-
-**Risks:** Container registry access requires internal credentials
-
-**Rollback:** Container approach is non-destructive.
+* **Duration:** 30 minutes for Docker approach
+* **Risks:** Container registry access requires internal credentials
+* **Rollback:** Container approach is non-destructive. 
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions

--- a/nvidia/vlm-finetuning/README.md
+++ b/nvidia/vlm-finetuning/README.md
@ -43,18 +43,20 @@ The setup includes:

 ## Time & risk

-**Duration**:
- 15-20 minutes for initial setup and model downloads
- 30-60 minutes for image VLM training (depending on dataset size)
- 1-2 hours for video VLM training (depending on video dataset size)
-
-**Risks**:
- Docker permission issues may require user group changes and a session restart
- Large model downloads and datasets may require significant disk space and time
- Training requires sustained GPU usage and memory
- Dataset preparation may require manual steps (Kaggle downloads, video processing)
-
-**Rollback**: Stop and remove Docker containers, delete downloaded models and datasets if needed.
+* **Duration**:
+  * 15-20 minutes for initial setup and model downloads
+  * 30-60 minutes for image VLM training (depending on dataset size)
+  * 1-2 hours for video VLM training (depending on video dataset size)
+* **Risks**:
+  * Docker permission issues may require user group changes and a session restart
+  * Large model downloads and datasets may require significant disk space and time
+  * Training requires sustained GPU usage and memory
+  * Dataset preparation may require manual steps (Kaggle downloads, video processing)
+* **Rollback**: Stop and remove Docker containers, delete downloaded models and datasets if needed.
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions

--- a/nvidia/vscode/README.md
+++ b/nvidia/vscode/README.md
@ -46,11 +46,13 @@ You will have Visual Studio Code running natively on your DGX Spark device with

 ## Time & risk

-**Duration:** 10-15 minutes
-
-**Risk level:** Low - installation uses official packages with standard rollback
-
-**Rollback:** Standard package removal via system package manager
+* **Duration:** 10-15 minutes
+* **Risk level:** Low - installation uses official packages with standard rollback
+* **Rollback:** Standard package removal via system package manager
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions

--- a/nvidia/vss/README.md
+++ b/nvidia/vss/README.md
@ -45,14 +45,16 @@ You will deploy NVIDIA's VSS AI Blueprint on NVIDIA Spark hardware with Blackwel

 ## Time & risk

-**Duration:** 30-45 minutes for initial setup, additional time for video processing validation
-
-**Risks:**
- Container startup can be resource-intensive and time-consuming with large model downloads
- Network configuration conflicts if shared network already exists
- Remote API endpoints may have rate limits or connectivity issues (hybrid deployment)
-
-**Rollback:** Stop all containers with `docker compose down`, remove shared network with `docker network rm vss-shared-network`, and clean up temporary media directories.
+* **Duration:** 30-45 minutes for initial setup, additional time for video processing validation
+* **Risks:**
+  * Container startup can be resource-intensive and time-consuming with large model downloads
+  * Network configuration conflicts if shared network already exists
+  * Remote API endpoints may have rate limits or connectivity issues (hybrid deployment)
+* **Rollback:** Stop all containers with `docker compose down`, remove shared network with `docker network rm vss-shared-network`, and clean up temporary media directories.
+* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```

 ## Instructions