From 231a45230d80ab2c9b6e54f23821e14c5e433df2 Mon Sep 17 00:00:00 2001 From: GitLab CI Date: Wed, 3 Jun 2026 15:15:33 +0000 Subject: [PATCH] chore: Regenerate all playbooks --- README.md | 4 + nvidia/cli-coding-agent/README.md | 474 ++++++ nvidia/cutile-kernels/README.md | 859 ++++++++++ .../assets/fmha_optimization_tutorial.py | 959 +++++++++++ .../assets/fmha_scaling_analysis.py | 891 +++++++++++ nvidia/nemoclaw-applications/README.md | 1412 +++++++++++++++++ nvidia/register-to-brev/README.md | 105 ++ 7 files changed, 4704 insertions(+) create mode 100644 nvidia/cli-coding-agent/README.md create mode 100644 nvidia/cutile-kernels/README.md create mode 100644 nvidia/cutile-kernels/assets/fmha_optimization_tutorial.py create mode 100644 nvidia/cutile-kernels/assets/fmha_scaling_analysis.py create mode 100644 nvidia/nemoclaw-applications/README.md create mode 100644 nvidia/register-to-brev/README.md diff --git a/README.md b/README.md index 6bc2bcf..4c8f17f 100644 --- a/README.md +++ b/README.md @@ -21,11 +21,13 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting ### NVIDIA +- [CLI Coding Agent](nvidia/cli-coding-agent/) - [Comfy UI](nvidia/comfy-ui/) - [Connect Three DGX Spark in a Ring Topology](nvidia/connect-three-sparks/) - [Set Up Local Network Access](nvidia/connect-to-your-spark/) - [Connect Two Sparks](nvidia/connect-two-sparks/) - [CUDA-X Data Science](nvidia/cuda-x-data-science/) +- [cuTile Kernels](nvidia/cutile-kernels/) - [DGX Dashboard](nvidia/dgx-dashboard/) - [FLUX.1 Dreambooth LoRA Fine-tuning](nvidia/flux-finetuning/) - [Run Hermes Agent with Local Models](nvidia/hermes-agent/) @@ -41,6 +43,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting - [NCCL for Two Sparks](nvidia/nccl/) - [Fine-tune with NeMo](nvidia/nemo-fine-tune/) - [Run NemoClaw with a Local LLM](nvidia/nemoclaw/) +- [🦞 Set Up Example NemoClaw Agents 🦞](nvidia/nemoclaw-applications/) - [Nemotron-3-Nano with llama.cpp](nvidia/nemotron/) - [NIM on Spark](nvidia/nim-llm/) - [NVFP4 Quantization](nvidia/nvfp4-quantization/) @@ -52,6 +55,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting - [Fine-tune with Pytorch](nvidia/pytorch-fine-tune/) - [RAG Application in AI Workbench](nvidia/rag-ai-workbench/) - [Spark & Reachy Photo Booth](nvidia/reachy-photo-booth/) +- [Register DGX Spark to Brev](nvidia/register-to-brev/) - [SGLang for Inference](nvidia/sglang/) - [Single-cell RNA Sequencing](nvidia/single-cell/) - [Speculative Decoding](nvidia/speculative-decoding/) diff --git a/nvidia/cli-coding-agent/README.md b/nvidia/cli-coding-agent/README.md new file mode 100644 index 0000000..70720e5 --- /dev/null +++ b/nvidia/cli-coding-agent/README.md @@ -0,0 +1,474 @@ +# CLI Coding Agent + +> Build local CLI coding agents with Ollama + +## Table of Contents + +- [Overview](#overview) +- [Claude Code](#claude-code) +- [OpenCode](#opencode) +- [Codex CLI](#codex-cli) +- [Troubleshooting](#troubleshooting) + +--- + +## Overview + +## Basic idea + +Use [Ollama](https://ollama.com) on [DGX Spark](https://www.nvidia.com/en-us/products/workstations/dgx-spark/) to run a local coding model and connect a CLI coding agent. This +playbook supports three options: **[Claude Code](https://docs.claude.com/en/docs/claude-code)**, **[OpenCode](https://opencode.ai)**, and **[Codex CLI](https://github.com/openai/codex)**. Each +agent is wired up with Ollama's built-in [launch method](https://ollama.com/blog/launch) (`ollama launch `), so you +can work without environment variables, provider config files, or external cloud APIs. + +## Choose your CLI agent + +Pick the tab that matches the CLI agent you want to use: + +- **Claude Code**: Fastest path to a working CLI agent with a local Ollama model. +- **OpenCode**: Open-source CLI launched directly from Ollama. +- **Codex CLI**: OpenAI Codex CLI launched directly from Ollama against the local model. + +## What you'll accomplish + +You will run a local coding model ([Qwen3.6](https://ollama.com/library/qwen3.6)) on your DGX Spark with Ollama, launch your +chosen CLI agent against it with a single command, and complete a small coding task end-to-end. + +## What to know before starting + +- Comfort with Linux command line basics +- Experience running terminal-based tools and editors +- Familiarity with Python for the short coding task + +## Prerequisites + +- DGX Spark access with NVIDIA DGX OS 7.3.1 (Ubuntu 24.04.3 LTS base) +- Internet access to download model weights +- [Ollama](https://ollama.com/download) v0.15 or newer (required for [`ollama launch`](https://ollama.com/blog/launch)) +- GPU memory depends on the Qwen3.6 variant you choose: + - `qwen3.6:latest` (35B-a3b, MoE) β€” ~24GB, 256K context + - `qwen3.6:35b-a3b-nvfp4` β€” ~22GB, NVIDIA FP4 build tuned for Blackwell (DGX Spark) + - `qwen3.6:35b-a3b-q8_0` β€” ~39GB, higher-quality quant + - `qwen3.6:35b-a3b-bf16` β€” ~71GB, full precision (fits Spark's unified memory) + +## Time & risk + +* **Duration**: ~15-25 minutes (mostly model download time) +* **Risk level**: Low + * Large model downloads can fail if network connectivity is unstable + * Ollama versions older than 0.15 do not support `ollama launch` +* **Rollback**: Stop Ollama and delete the downloaded model from `~/.ollama/models` +* **Last Updated:** 04/16/2026 + * Switched to `ollama launch` method and upgraded the default model to Qwen3.6 + +## Claude Code + +## Step 1. Confirm your environment + +**Description**: Verify the OS version and GPU are visible before installing anything. + +```bash +cat /etc/os-release | head -n 2 +nvidia-smi +``` + +Expected output should show Ubuntu 24.04.3 LTS (DGX OS 7.3.1 base) and a detected GPU. + +## Step 2. Install or update Ollama + +**Description**: Install [Ollama](https://ollama.com/download) or ensure it is recent enough to support [`ollama launch`](https://ollama.com/blog/launch). + +```bash +curl -fsSL https://ollama.com/install.sh | sh +ollama --version +``` + +If Ollama is already installed, just verify the version: + +```bash +ollama --version +``` + +Expected output should show Ollama v0.15 or newer. + +## Step 3. Pull Qwen3.6 + +**Description**: Download the [Qwen3.6](https://ollama.com/library/qwen3.6) model weights to your Spark node. + +```bash +ollama pull qwen3.6 +``` + +Optional variants if you want different memory footprints or precision: + +```bash +ollama pull qwen3.6:35b-a3b-nvfp4 # NVIDIA FP4 build tuned for Blackwell (~22GB) +ollama pull qwen3.6:35b-a3b-q8_0 # Higher-quality 8-bit quant (~39GB) +ollama pull qwen3.6:35b-a3b-bf16 # Full precision (~71GB) +``` + +Expected output should show `qwen3.6` (and any optional variants) in `ollama list`. + +## Step 4. Test local inference (optional) + +**Description**: Run a quick prompt to confirm the model loads. + +```bash +ollama run qwen3.6 +``` + +Try a prompt like: + +```text +Write a short README checklist for a Python project. +``` + +Expected output should show the model responding in the terminal. When you are done, type `/bye` or press `Ctrl+D` to exit the interactive session before continuing. + +## Step 5. Launch Claude Code with Ollama + +**Description**: Use Ollama's built-in [launch method](https://ollama.com/blog/launch) to start [Claude Code](https://docs.claude.com/en/docs/claude-code) against your local model. No environment variables or config files are required. + +```bash +ollama launch claude +``` + +Expected output should show Claude Code starting and using the local Qwen3.6 model. Qwen3.6 ships with a 256K context window by default; adjust context length through Ollama's settings if you need to tune it further. + +## Step 6. Complete a small coding task + +**Description**: Create a tiny repo and let Claude Code implement a function and tests. + +```bash +mkdir -p ~/cli-agent-demo +cd ~/cli-agent-demo + +printf 'def add(a, b):\n """Return the sum of a and b."""\n pass\n' > math_utils.py +printf 'import math_utils\n\n\ndef test_add():\n assert math_utils.add(1, 2) == 3\n' > test_math_utils.py +``` + +If you do not already have pytest installed: + +```bash +python3 -m pip install -U pytest +``` + +In Claude Code: + +```text +Please implement add() in math_utils.py and make sure the test passes. +``` + +Run the test: + +```bash +python3 -m pytest -q +``` + +Expected output should show the test passing. + +## Step 7. Cleanup and rollback + +**Description**: Remove the model and stop services if you no longer need them. + +To stop the service: + +```bash +sudo systemctl stop ollama +``` + +> [!WARNING] +> This will delete the downloaded model files. + +```bash +ollama rm qwen3.6 +``` + +## Step 8. Next steps + +- Try the `qwen3.6:35b-a3b-nvfp4` or `bf16` variants for different quality/VRAM tradeoffs +- Use Claude Code on multi-file refactors or test-generation tasks +- Explore the full 256K context window on larger codebases + +## OpenCode + +## Step 1. Confirm your environment + +**Description**: Verify the OS version and GPU are visible before installing anything. + +```bash +cat /etc/os-release | head -n 2 +nvidia-smi +``` + +Expected output should show Ubuntu 24.04.3 LTS (DGX OS 7.3.1 base) and a detected GPU. + +## Step 2. Install or update Ollama + +**Description**: Install [Ollama](https://ollama.com/download) or ensure it is recent enough to support [`ollama launch`](https://ollama.com/blog/launch). + +```bash +curl -fsSL https://ollama.com/install.sh | sh +ollama --version +``` + +If Ollama is already installed, just verify the version: + +```bash +ollama --version +``` + +Expected output should show Ollama v0.15 or newer. + +## Step 3. Pull Qwen3.6 + +**Description**: Download the [Qwen3.6](https://ollama.com/library/qwen3.6) model weights to your Spark node. + +```bash +ollama pull qwen3.6 +``` + +Optional variants if you want different memory footprints or precision: + +```bash +ollama pull qwen3.6:35b-a3b-nvfp4 # NVIDIA FP4 build tuned for Blackwell (~22GB) +ollama pull qwen3.6:35b-a3b-q8_0 # Higher-quality 8-bit quant (~39GB) +ollama pull qwen3.6:35b-a3b-bf16 # Full precision (~71GB) +``` + +Expected output should show `qwen3.6` in `ollama list`. + +## Step 4. Test local inference (optional) + +**Description**: Run a quick prompt to confirm the model loads. + +```bash +ollama run qwen3.6 +``` + +Try a prompt like: + +```text +Write a short README checklist for a Python project. +``` + +Expected output should show the model responding. When you are done, type `/bye` or press `Ctrl+D` to exit before continuing. + +## Step 5. Launch OpenCode with Ollama + +**Description**: Use Ollama's built-in [launch method](https://ollama.com/blog/launch) to start [OpenCode](https://opencode.ai) against your local model. No [`opencode.json`](https://opencode.ai/docs/config/) provider configuration is required. + +```bash +ollama launch opencode +``` + +If you want to pre-configure OpenCode without launching immediately: + +```bash +ollama launch opencode --config +``` + +Expected output should show OpenCode starting with Ollama preselected as the provider and Qwen3.6 as the model. Qwen3.6 ships with a 256K context window by default. + +## Step 6. Complete a small coding task + +**Description**: Create a tiny repo and let OpenCode implement a function and tests. + +```bash +mkdir -p ~/cli-agent-demo +cd ~/cli-agent-demo + +printf 'def add(a, b):\n """Return the sum of a and b."""\n pass\n' > math_utils.py +printf 'import math_utils\n\n\ndef test_add():\n assert math_utils.add(1, 2) == 3\n' > test_math_utils.py +``` + +If you do not already have pytest installed: + +```bash +python3 -m pip install -U pytest +``` + +In OpenCode: + +```text +Please implement add() in math_utils.py and make sure the test passes. +``` + +Run the test: + +```bash +python3 -m pytest -q +``` + +Expected output should show the test passing. + +## Step 7. Cleanup and rollback + +**Description**: Remove the model and stop services if you no longer need them. + +To stop the service: + +```bash +sudo systemctl stop ollama +``` + +> [!WARNING] +> This will delete the downloaded model files. + +```bash +ollama rm qwen3.6 +``` + +## Step 8. Next steps + +- Try the `qwen3.6:35b-a3b-nvfp4` or `bf16` variants for different quality/VRAM tradeoffs +- Use OpenCode on multi-file changes or test-generation tasks +- Explore the full 256K context window on larger codebases + +## Codex CLI + +## Step 1. Confirm your environment + +**Description**: Verify the OS version and GPU are visible before installing anything. + +```bash +cat /etc/os-release | head -n 2 +nvidia-smi +``` + +Expected output should show Ubuntu 24.04.3 LTS (DGX OS 7.3.1 base) and a detected GPU. + +## Step 2. Install or update Ollama + +**Description**: Install [Ollama](https://ollama.com/download) or ensure it is recent enough to support [`ollama launch`](https://ollama.com/blog/launch). + +```bash +curl -fsSL https://ollama.com/install.sh | sh +ollama --version +``` + +If Ollama is already installed, just verify the version: + +```bash +ollama --version +``` + +Expected output should show Ollama v0.15 or newer. + +## Step 3. Pull Qwen3.6 + +**Description**: Download the [Qwen3.6](https://ollama.com/library/qwen3.6) model weights to your Spark node. + +```bash +ollama pull qwen3.6 +``` + +Optional variants if you want different memory footprints or precision: + +```bash +ollama pull qwen3.6:35b-a3b-nvfp4 # NVIDIA FP4 build tuned for Blackwell (~22GB) +ollama pull qwen3.6:35b-a3b-q8_0 # Higher-quality 8-bit quant (~39GB) +ollama pull qwen3.6:35b-a3b-bf16 # Full precision (~71GB) +``` + +Expected output should show `qwen3.6` in `ollama list`. + +## Step 4. Test local inference (optional) + +**Description**: Run a quick prompt to confirm the model loads. + +```bash +ollama run qwen3.6 +``` + +Try a prompt like: + +```text +Write a short README checklist for a Python project. +``` + +Expected output should show the model responding. When you are done, type `/bye` or press `Ctrl+D` to exit before continuing. + +## Step 5. Launch Codex CLI with Ollama + +**Description**: Use Ollama's built-in [launch method](https://ollama.com/blog/launch) to start [Codex CLI](https://github.com/openai/codex) against your local model. No `~/.codex/config.toml` and no manual `npm install -g @openai/codex` are required β€” Ollama handles the Codex integration. + +```bash +ollama launch codex +``` + +Expected output should show Codex CLI starting with Ollama as the provider and Qwen3.6 as the model. Qwen3.6 ships with a 256K context window by default, which is well suited to Codex's agentic workflows. + +## Step 6. Complete a small coding task + +**Description**: Create a tiny repo and let Codex implement a function and tests. + +```bash +mkdir -p ~/cli-agent-demo +cd ~/cli-agent-demo + +printf 'def add(a, b):\n """Return the sum of a and b."""\n pass\n' > math_utils.py +printf 'import math_utils\n\n\ndef test_add():\n assert math_utils.add(1, 2) == 3\n' > test_math_utils.py +``` + +If you do not already have pytest installed: + +```bash +python3 -m pip install -U pytest +``` + +In Codex: + +```text +Please implement add() in math_utils.py and make sure the test passes. +``` + +Run the test: + +```bash +python3 -m pytest -q +``` + +Expected output should show the test passing. + +## Step 7. Cleanup and rollback + +**Description**: Remove the model and stop services if you no longer need them. + +To stop the service: + +```bash +sudo systemctl stop ollama +``` + +> [!WARNING] +> This will delete the downloaded model files. + +```bash +ollama rm qwen3.6 +``` + +## Step 8. Next steps + +- Try the `qwen3.6:35b-a3b-nvfp4` or `bf16` variants for different quality/VRAM tradeoffs +- Use Codex CLI on multi-file changes or test-generation tasks +- Explore the full 256K context window on larger codebases + +## Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| `ollama: command not found` | Ollama not installed or PATH not updated | Rerun `curl -fsSL https://ollama.com/install.sh \| sh` and open a new shell | +| `ollama launch` reports unknown command | Ollama is older than v0.15 | Update Ollama: `curl -fsSL https://ollama.com/install.sh \| sh` | +| Model load fails with version error or HTTP 412 | Ollama version is too old for the model | Update Ollama: `curl -fsSL https://ollama.com/install.sh \| sh` | +| `model not found` when launching an agent | Model was not pulled | Run `ollama pull qwen3.6` and retry | +| `connection refused` to localhost:11434 | Ollama service not running | Start with `ollama serve` or `sudo systemctl start ollama` | +| `ollama launch ` exits immediately | Agent integration failed to initialize | Re-run `ollama launch `; if it persists, check `journalctl -u ollama` | +| Slow responses or OOM errors | Model variant too large for GPU memory | Switch to `qwen3.6:35b-a3b-nvfp4` or close other GPU workloads | + +> [!NOTE] +> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing +> between the GPU and CPU. If you see memory pressure, flush the buffer cache with: +> ```bash +> sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' +> ``` diff --git a/nvidia/cutile-kernels/README.md b/nvidia/cutile-kernels/README.md new file mode 100644 index 0000000..dfeeda0 --- /dev/null +++ b/nvidia/cutile-kernels/README.md @@ -0,0 +1,859 @@ +# cuTile Kernels + +> Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300 + + +## Table of Contents + +- [Overview](#overview) +- [Kernel Benchmarks](#kernel-benchmarks) +- [End-to-End Inference](#end-to-end-inference) +- [FMHA Implementation](#fmha-implementation) + - [Attention Basics](#attention-basics) + - [Flash Attention Algorithm](#flash-attention-algorithm) + - [cuTile Pseudocode β†’ Actual Mapping](#cutile-pseudocode-actual-mapping) + - [Kernel Pseudocode](#kernel-pseudocode) + - [cuTile Implementation](#cutile-implementation) + - [Launching the Kernel](#launching-the-kernel) + - [Optimizations](#optimizations) + - [Platform Configuration](#platform-configuration) + - [Performance Results](#performance-results) + - [Common Issues](#common-issues) + - [Companion Scripts](#companion-scripts) + - [References](#references) +- [Platform Comparison](#platform-comparison) + - [End-to-End Throughput](#end-to-end-throughput) + - [CUDA Kernel Time](#cuda-kernel-time) + - [cuTile Kernel Breakdown](#cutile-kernel-breakdown) +- [Troubleshooting](#troubleshooting) + +--- + +## Overview + +## Basic idea + +[TileGym](https://github.com/NVIDIA/TileGym) is NVIDIA's benchmark suite and integration framework for cuTile kernels - high-performance GPU kernels written using the cuTile Python DSL. cuTile compiles to Tile IR, enabling developers to write efficient kernels without low-level CUDA programming. + +This playbook covers three workflows: +1. **[Kernel Benchmarks](kernel-benchmarks)** - Run standalone cuTile kernel benchmarks (FMHA, MatMul, RMSNorm, etc.) +2. **[End-to-End Inference](e2e-inference)** - Run LLM inference with cuTile-optimized kernels via monkey-patching +3. **[FMHA Implementation](fmha)** - Step-by-step tutorial building a Flash Multi-Head Attention kernel from pseudocode to optimized cuTile, with companion scripts to run and benchmark + +The same cuTile code runs on both DGX Spark (sm_121) and B300 (sm_103) - cuTile JIT compiles to the appropriate GPU architecture automatically. + +## What you'll accomplish + +- Run the TileGym benchmark suite on DGX Spark +- Run Qwen2-7B or DeepSeek-V2-Lite inference with cuTile-optimized kernels +- Observe performance scaling between DGX Spark and B300 +- Build an FMHA kernel step-by-step from pseudocode to optimized cuTile implementation + +## What to know before starting + +- Basic familiarity with Docker and command-line tools +- Understanding of GPU compute concepts (TFLOPS, memory bandwidth) +- No CUDA programming experience required +- HuggingFace account with access token (for LLM inference) + +## Prerequisites + +**Hardware Requirements:** +- DGX Spark with Ubuntu 24.04 or B300 cloud instance +- Minimum 16GB GPU memory for LLM inference +- At least 50GB available storage space for model downloads + +**Software Requirements:** +- Docker installed and configured: `docker ps` +- CUDA Toolkit 13.x with Tile IR support +- HuggingFace token for model access (LLM inference only) +- Network access for pulling containers and downloading models + +Verify Docker is available: +```bash +docker ps +``` + +If you get a permission error: +```bash +sudo usermod -aG docker $USER +newgrp docker +``` + +## Kernel support matrix + +| Kernel | Category | Data Types | Description | +|--------|----------|------------|-------------| +| **FMHA** | Attention | float16, float8 | Flash Multi-Head Attention | +| **MLA** | Attention | bfloat16, float8 | Multi-head Latent Attention | +| **MLA Decoding** | Attention | float16, float8 | MLA for decode phase | +| **MatMul** | Matrix Ops | float16, float8 | Matrix multiplication | +| **BMM** | Matrix Ops | float16 | Batched matrix multiplication | +| **Group GEMM** | Matrix Ops | float16, float8 | Grouped GEMM for MoE | +| **RMSNorm** | Normalization | float16, bfloat16 | Root mean square normalization | +| **RoPE** | Positional | float16 | Rotary position embedding | +| **SiLU** | Activation | float16, float32 | SiLU activation with multiply | +| **SwiGLU** | Activation | float16, float32 | SwiGLU fused operation | +| **Softmax** | Activation | float16 | Softmax normalization | +| **Dropout** | Regularization | float16, float32 | Dropout forward | + +## Model support for LLM inference + +| Model | Supported Kernels | Batch Size | Output Tokens | Notes | +|-------|-------------------|------------|---------------|-------| +| **Qwen2-7B** | RoPE, RMSNorm, SwiGLU, FMHA | 16 | 50 | Standard transformer | +| **DeepSeek-V2-Lite** | RoPE, RMSNorm, SiLU, MLA, MoE | 1 | 100 | MLA attention, MoE layers | + +## Ancillary files + +All required assets can be found in the [TileGym repository](https://github.com/NVIDIA/TileGym). + +- `tests/benchmark/run_all.sh` - Run all kernel benchmarks +- `modeling/transformers/bench_qwen.sh` - Qwen2-7B benchmark script +- `modeling/transformers/bench_deepseek.sh` - DeepSeek-V2-Lite benchmark script +- `modeling/transformers/infer.py` - Main inference script with TileGym integration +- [`assets/fmha_optimization_tutorial.py`](assets/fmha_optimization_tutorial.py) - FMHA step-by-step optimization tutorial +- [`assets/fmha_scaling_analysis.py`](assets/fmha_scaling_analysis.py) - FMHA scaling analysis across sequence lengths + +## Time & risk + +* **Estimated time:** 30-45 minutes (including model download for LLM inference) +* **Risk level:** Low + * Large downloads may fail due to network issues + * First run includes JIT compilation overhead +* **Rollback:** Remove Docker container to undo all changes +* **Last Updated:** February 2026 + * First Publication + +## Kernel Benchmarks + +## Step 1. Pull CUDA NGC container with CTK 13.x + +```bash +docker pull nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 +``` + +Launch an interactive session with GPU access: + +```bash +docker run --gpus all -it --rm \ + -v ~/TileGym:/workspace/TileGym \ + nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \ + /bin/bash +``` + +> [!NOTE] +> The `-v` flag mounts a local directory to persist the TileGym repository. The `--rm` flag automatically removes the container when you exit; omit it if you want to keep the container for later use. + +Or if running outside a container, install Tile IR directly: + +```bash +## Requires root privileges - run with sudo or as root +sudo apt-get install cuda-tile-ir-13-1 cuda-compiler-13-1 +``` + +## Step 2. Clone TileGym repository + +```bash +git clone https://github.com/NVIDIA/TileGym +cd TileGym +pip install . +``` + +## Step 3. Run benchmark suite + +```bash +cd tests/benchmark/ +bash run_all.sh +``` + +> [!NOTE] +> The benchmark runs sequentially to ensure accurate timing results. This may take 10-15 minutes to complete all kernels. + +## Step 4. View results + +Results show cuTile performance for each kernel and sequence length. + +Expected output should look like: + +```text +========================================== +Running bench_fused_attention.py... +========================================== +fused-attention-batch4-head32-d128-fwd-causal=True-float16-TFLOPS: + N_CTX CuTile +0 1024.0 58.188262 +1 2048.0 80.906892 +2 4096.0 86.189532 +3 8192.0 88.891086 +4 16384.0 89.491869 +βœ“ PASSED: bench_fused_attention.py +``` + +## Step 5. Run individual benchmarks + +To run specific kernel benchmarks: + +```bash +## Flash Multi-Head Attention +python bench_fused_attention.py + +## Matrix Multiplication +python bench_matrix_multiplication.py + +## RMSNorm +python bench_rmsnorm.py + +## RoPE +python bench_rope.py + +## SwiGLU +python bench_swiglu.py +``` + +## Step 6. Clean up + +Exit the container: + +```bash +exit +``` + +Remove this workflow's containers (if you ran without `--rm`): + +```bash +## Preferred: remove only containers from this workflow's image +docker rm $(docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 --format '{{.ID}}') + +## Alternative: prune all stopped containers (will prompt for confirmation) +## docker container prune +``` + +Remove the image (optional): + +```bash +docker rmi nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 +``` + +## Step 7. Repeat on B300 + +Repeat Steps 1-6 on B300 hardware to observe scaling. See the **Platform Comparison** tab for expected scaling results. + +## End-to-End Inference + +## Step 1. Set up environment + +If you haven't already, pull the CUDA container and clone TileGym (see **Kernel Benchmarks** tab for details). + +First, clone TileGym on the host: + +```bash +mkdir -p ~/TileGym +git clone https://github.com/NVIDIA/TileGym ~/TileGym +``` + +Then launch the container with the repository mounted: + +```bash +docker run --gpus all -it --rm \ + -v ~/TileGym:/workspace/TileGym \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \ + /bin/bash +``` + +> [!NOTE] +> The `-v ~/.cache/huggingface:/root/.cache/huggingface` mounts your HuggingFace cache to avoid re-downloading models. + +Install TileGym inside the container: + +```bash +cd /workspace/TileGym +pip install . +``` + +Set your HuggingFace token for accessing gated models: + +```bash +export HF_TOKEN= +``` + +> [!WARNING] +> You need a HuggingFace account and access token. Get one at https://huggingface.co/settings/tokens + +## Step 2. Run inference benchmark + +Navigate to the transformers benchmark directory: + +```bash +cd modeling/transformers +``` + +**Option A: Run Qwen2-7B benchmark** + +```bash +./bench_qwen.sh +``` + +Configuration: Model `Qwen/Qwen2-7B`, Batch size 16, Output length 50 tokens. + +**Option B: Run DeepSeek-V2-Lite benchmark** + +```bash +./bench_deepseek.sh +``` + +Configuration: Model `deepseek-ai/DeepSeek-V2-Lite-Chat`, Batch size 1, Output length 100 tokens. + +Both scripts run two configurations: +1. **PyTorch baseline** - Standard HuggingFace inference +2. **TileGym cuTile** - With cuTile kernel replacements + +## Step 3. View results + +**Sample DGX Spark (GB10) Results for Qwen2-7B:** + +```text +======================================== + Benchmark Results +======================================== +Qwen2-7B_naive_bfloat16 | 15.66 tokens/s | 51.10s | 51151.0ms CUDA +Qwen2-7B_cutile_attn | 18.52 tokens/s | 43.20s | 43079.7ms CUDA +======================================== +``` + +**cuTile Kernel Breakdown (DGX Spark - Qwen2):** + +| Kernel | CUDA Time (ms) | Calls | +|--------|----------------|-------| +| `fmha_kernel` | 4185.9 | 28 | +| `swiglu_forward_kernel` | 2459.8 | 1400 | +| `attention_decode_kernel_grouped` | 2271.8 | 1372 | +| `rms_norm_kernel_static_persistent` | 634.7 | 57 | +| `rope_kernel` | 355.6 | 1400 | + +## Step 4. How TileGym monkey-patching works + +TileGym replaces PyTorch model operations with cuTile kernels. The snippet below is taken from TileGym's [`src/tilegym/transformers/monkey_patch.py`](https://github.com/NVIDIA/TileGym/blob/main/src/tilegym/transformers/monkey_patch.py) and invoked from [`modeling/transformers/infer.py`](https://github.com/NVIDIA/TileGym/blob/main/modeling/transformers/infer.py): + +```python +from tilegym.transformers import apply_tilegym_kernel_to_qwen2 + +apply_tilegym_kernel_to_qwen2( + rope=True, # Replace RoPE with cuTile kernel + rms_norm=True, # Replace RMSNorm with cuTile kernel + swiglu=True, # Replace SwiGLU with cuTile kernel + attn=True, # Replace attention with cuTile FMHA + use_cutile=True # Use cuTile backend (vs Triton) +) +``` + +**Patched Kernels for Qwen2:** + +| Kernel | PyTorch Operation | cuTile Replacement | +|--------|-------------------|-------------------| +| `rms_norm_kernel_static_persistent` | `nn.RMSNorm` | Persistent RMSNorm | +| `rope_kernel` | Rotary position embedding | Fused RoPE | +| `fmha_kernel` | `F.scaled_dot_product_attention` | Flash Attention | +| `swiglu_forward_kernel` | SiLU + Mul | Fused SwiGLU | +| `attention_decode_kernel_grouped` | Decode attention | Grouped decode | + +**Patched Kernels for DeepSeek-V2:** (see [`src/tilegym/transformers/monkey_patch.py`](https://github.com/NVIDIA/TileGym/blob/main/src/tilegym/transformers/monkey_patch.py)) + +```python +from tilegym.transformers import apply_tilegym_kernel_to_deepseek_v2 + +apply_tilegym_kernel_to_deepseek_v2( + rope=True, # Replace RoPE with cuTile kernel + rms_norm=True, # Replace RMSNorm with cuTile kernel + swiglu=True, # Replace SiLU+Mul with cuTile kernel + attn=True, # Replace MLA attention with cuTile + moe=True, # Replace MoE routing with cuTile + use_cutile=True +) +``` + +| Kernel | PyTorch Operation | cuTile Replacement | +|--------|-------------------|-------------------| +| `prefill_mla` | MLA prefill attention | Multi-head Latent Attention | +| `_mla_decoding_split_kv` | MLA decode attention | Split-KV decoding | +| `fused_moe_kernel` | MoE expert routing | Fused MoE | +| `group_gemm_kernel` | Expert FFN | Grouped GEMM | + +## Step 5. Platform-specific tuning (Advanced) + +cuTile exposes two complementary performance-tuning mechanisms: + +- **[`ct.ByTarget`](https://docs.nvidia.com/cuda/cutile-python/performance.html)** - Select different kernel launch parameters per GPU architecture (`sm_`). The compiler picks the value matching the current target at JIT time; if no entry matches, the `default` value is used. See the [Performance Tuning](https://docs.nvidia.com/cuda/cutile-python/performance.html) and [Execution Model](https://docs.nvidia.com/cuda/cutile-python/execution.html) pages. +- **`num_ctas`** - Number of Cooperative Thread Arrays (thread blocks) launched per kernel invocation. Tune to the number of SMs on the target GPU. +- **`occupancy`** - Hint for the number of concurrent CTAs the compiler should target per SM. Higher occupancy hides memory latency but increases register/shared-memory pressure. See the [Execution Model](https://docs.nvidia.com/cuda/cutile-python/execution.html) documentation. +- **[`ct.autotune`](https://docs.nvidia.com/cuda/cutile-python/performance.html)** - Search a list of candidate values at runtime and pick the fastest configuration. Results are reported via [`cuda.tile.tune.TuningResult`](https://docs.nvidia.com/cuda/cutile-python/generated/cuda.tile.tune.TuningResult.html) / [`Measurement`](https://docs.nvidia.com/cuda/cutile-python/generated/cuda.tile.tune.Measurement.html). + +```python +import cuda.tile as ct + +@ct.kernel( +# # num_ctas: how many thread blocks to launch. +# # Use ByTarget to pick an arch-specific value at JIT time. + num_ctas=ct.ByTarget({ + "sm_103": 8, # B300 - more SMs, launch more CTAs + "sm_121": 4, # DGX Spark - fewer SMs (48), use fewer CTAs + "default": 1, # Fallback for any other GPU architecture + }), +# # occupancy: hint for concurrent CTAs per SM (latency hiding vs. register pressure). + occupancy=ct.ByTarget({ + "sm_103": 16, # B300 - high occupancy, plenty of registers/SMEM + "sm_121": 12, # DGX Spark - moderate occupancy + "default": 8, # Conservative fallback + }), + opt_level=3 # Maximum compiler optimization level +) +def optimized_kernel(A, B, C): +# # Same kernel code works on all platforms; +# # ByTarget swaps in the arch-specific launch params automatically. + ... +``` + +For automatic tuning, use [`ct.autotune`](https://docs.nvidia.com/cuda/cutile-python/performance.html) to search over candidate values and pick the fastest configuration at runtime: + +```python +@ct.kernel( +# # autotune: benchmark each value and pick the fastest. + num_ctas=ct.autotune([1, 2, 4, 8, 16]), + occupancy=ct.autotune([8, 12, 16, 24]), + opt_level=3 +) +def autotuned_kernel(A, B, C): + ... +``` + +## Step 6. Repeat on B300 + +Repeat Steps 1-3 on B300 hardware. The **same code runs without modification** - cuTile JIT compiles for sm_103 automatically. + +See the **Platform Comparison** tab for detailed scaling results. + +## FMHA Implementation + +## FMHA Implementation Guide + +> [!NOTE] +> This is a guide to understanding FMHA implementation in cuTile, not a complete reference. For comprehensive documentation, see the [cuTile Python Documentation](https://docs.nvidia.com/cuda/cutile-python/). + +### Attention Basics + +Attention allows a neural network to focus on relevant parts of the input. In transformers (GPT, LLaMA, Qwen), each position computes how much to attend to every other position using three vectors: + +- **Query (Q)**: "What am I looking for?" +- **Key (K)**: "What do I contain?" +- **Value (V)**: "Here is my content" + +```text +Attention(Q, K, V) = softmax(Q Γ— K^T / √d) Γ— V + +Shapes: + Q, K, V = [batch, heads, seq_len, head_dim] + Q Γ— K^T = [batch, heads, seq_len, seq_len] # Attention scores + Output = [batch, heads, seq_len, head_dim] +``` + +For autoregressive models, **causal masking** ensures each token only attends to previous tokens by setting future scores to -infinity before softmax. + +### Flash Attention Algorithm + +Standard attention materializes a [seq_len Γ— seq_len] matrix (e.g., 2 GB for seq_len=32768). Flash Attention avoids this by processing in tiles with **online softmax**: + +```text +m = -infinity # Running maximum +l = 0 # Running sum of exp(x - m) +acc = 0 # Running weighted sum of values + +FOR each K,V tile: + scores = Q_tile @ K_tile.T * scale + m_new = max(m, max(scores)) + correction = exp(m - m_new) + l = l * correction + sum(exp(scores - m_new)) + acc = acc * correction + exp(scores - m_new) @ V_tile + m = m_new + +output = acc / l +``` + +### cuTile Pseudocode β†’ Actual Mapping + +| Concept | Pseudocode | cuTile | +|---|---|---| +| Define kernel | `KERNEL fmha(...)` | `@ct.kernel()` | +| Get block ID | `block_x = BLOCK_ID_X` | `bid_x = ct.bid(0)` | +| Create indices | `range(0, N)` | `ct.arange(N, dtype=ct.int32)` | +| Create constant tile | `tile = zeros(M, N)` | `ct.full((M, N), 0.0, dtype)` | +| Load from memory | `tile = LOAD(ptr, shape)` | `ct.load(tensor, index, shape)` | +| Store to memory | `STORE(ptr, tile)` | `ct.store(tensor, index, tile)` | +| Matrix multiply | `C = A @ B + C` | `ct.mma(A, B, C)` | +| Reduction | `max_val = MAX(tile, axis)` | `ct.max(tile, axis, keepdims)` | + +### Kernel Pseudocode + +```text +KERNEL fmha(Q, K, V, Out, scale, TILE_M, TILE_N): + tile_row = BLOCK_ID_X + batch_head = BLOCK_ID_Y + batch = batch_head // num_heads + head = batch_head % num_heads + + m_i = full(TILE_M, -infinity) + l_i = full(TILE_M, 0) + acc = zeros(TILE_M, head_dim) + + q = LOAD(Q[batch, head, tile_row*TILE_M : (tile_row+1)*TILE_M, :]) + + FOR j = 0 to num_k_tiles: + k = LOAD(K[batch, head, j*TILE_N : (j+1)*TILE_N, :]) + v = LOAD(V[batch, head, j*TILE_N : (j+1)*TILE_N, :]) + scores = MMA(q, transpose(k)) * scale + IF causal AND in_mask_region: + scores = WHERE(valid_mask, scores, -infinity) + m_new = max(m_i, row_max(scores)) + correction = exp(m_i - m_new) + p = exp(scores - m_new) + l_i = l_i * correction + row_sum(p) + acc = acc * correction + MMA(p, v) + m_i = m_new + + out = acc / l_i + STORE(Out[batch, head, tile_row*TILE_M :, :], out) +``` + +### cuTile Implementation + +```python +import cuda.tile as ct +import math +ConstInt = ct.Constant[int] +ConstBool = ct.Constant[bool] + +@ct.kernel() +def fmha_kernel(Q, K, V, Out, qk_scale: float, TILE_D: ConstInt, H: ConstInt, + TILE_M: ConstInt, TILE_N: ConstInt, CAUSAL: ConstBool): + bid_x, bid_y = ct.bid(0), ct.bid(1) + batch_idx, head_idx = bid_y // H, bid_y % H + + offs_m = (bid_x * TILE_M + ct.arange(TILE_M, dtype=ct.int32))[:, None] + offs_n_tile = ct.arange(TILE_N, dtype=ct.int32)[None, :] + + m_i = ct.full((TILE_M, 1), -math.inf, dtype=ct.float32) + l_i = ct.full((TILE_M, 1), 0.0, dtype=ct.float32) + acc = ct.full((TILE_M, TILE_D), 0.0, dtype=ct.float32) + + q = ct.load(Q, index=(batch_idx, head_idx, bid_x, 0), + shape=(1, 1, TILE_M, TILE_D)).reshape((TILE_M, TILE_D)) + + k_seqlen = K.shape[2] + if CAUSAL: + Tc = ct.cdiv(min((bid_x + 1) * TILE_M, k_seqlen), TILE_N) + mask_start = (bid_x * TILE_M) // TILE_N + else: + Tc = ct.cdiv(k_seqlen, TILE_N) + mask_start = k_seqlen // TILE_N + + for j in range(0, Tc): + k_tile = ct.load(K, index=(batch_idx, head_idx, j, 0), + shape=(1, 1, TILE_N, TILE_D)).reshape((TILE_N, TILE_D)) + k_t = ct.permute(k_tile, (1, 0)) + + qk = ct.mma(q, k_t, ct.full((TILE_M, TILE_N), 0.0, dtype=ct.float32)) + qk = qk * qk_scale + + if CAUSAL and j >= mask_start: + offs_n = j * TILE_N + offs_n_tile + qk = ct.where(offs_m >= offs_n, qk, + ct.full((TILE_M, TILE_N), -math.inf, dtype=ct.float32)) + + m_ij = ct.maximum(m_i, ct.max(qk, axis=-1, keepdims=True)) + qk = qk - m_ij + p = ct.exp(qk) + alpha = ct.exp(m_i - m_ij) + l_i = l_i * alpha + ct.sum(p, axis=-1, keepdims=True) + acc = acc * alpha + + v_tile = ct.load(V, index=(batch_idx, head_idx, j, 0), + shape=(1, 1, TILE_N, TILE_D)).reshape((TILE_N, TILE_D)) + acc = ct.mma(p.astype(Q.dtype), v_tile, acc) + m_i = m_ij + + acc = (acc / l_i).reshape((1, 1, TILE_M, TILE_D)).astype(Out.dtype) + ct.store(Out, index=(batch_idx, head_idx, bid_x, 0), tile=acc) +``` + +### Launching the Kernel + +```python +def run_fmha(q, k, v, sm_scale, is_causal=True): + import torch + TILE_M, TILE_N = 64, 64 # Platform-specific (see below) + batch, num_heads, seq_len, head_dim = q.shape + out = torch.empty_like(q) + grid = (math.ceil(seq_len / TILE_M), batch * num_heads, 1) + ct.launch( + torch.cuda.current_stream(), grid, fmha_kernel, + (q, k, v, out, sm_scale, head_dim, num_heads, TILE_M, TILE_N, is_causal) + ) + return out +``` + +### Optimizations + +#### exp2 + flush_to_zero + +`exp2(x) = 2^x` is faster than `exp(x)` on GPU. Requires scale adjustment by `1/log(2)`. + +```python +## Convert natural-exp scale to base-2 so we can use the faster ct.exp2 intrinsic. +## exp(x) == exp2(x / log(2)) == exp2(x * INV_LOG_2). +INV_LOG_2 = 1.0 / math.log(2) # β‰ˆ 1.4427 +qk_scale_log2 = qk_scale * INV_LOG_2 # Pre-multiply the softmax scale once + +## ... in loop: +## Fuse the running-max update with the scale multiplication. +m_ij = ct.max(qk, axis=-1, keepdims=True) * qk_scale_log2 +## Subtract the running max for numerical stability (online softmax). +qk = qk * qk_scale_log2 - m_ij +## flush_to_zero=True: flush denormals to 0 -> avoids slow denormal handling on GPU. +p = ct.exp2(qk, flush_to_zero=True) +alpha = ct.exp2(m_i - m_ij, flush_to_zero=True) # Correction factor for previous acc/l_i +``` + +#### Load Order Transpose + +Load K already transposed using `order` parameter, avoiding explicit permute. + +```python +## order=(0,1,3,2) swaps the last two axes during the load, +## producing K^T directly in registers -- no extra ct.permute() needed. +## shape is expressed in the transposed layout: (1, 1, TILE_D, TILE_N). +k_t = ct.load(K, index=(..., 0, j), shape=(1,1,TILE_D,TILE_N), + order=(0,1,3,2)).reshape((TILE_D, TILE_N)) +``` + +#### Latency Hints + +Prefetch data to overlap memory loads with computation. See the [Performance Tuning docs](https://docs.nvidia.com/cuda/cutile-python/performance.html) for the full list of load/store hints (e.g. `allow_tma`, `latency`). + +```python +## latency=N tells the compiler to issue this load N loop iterations in +## advance of its use, so the memory transfer overlaps with the MMA work +## from earlier iterations. Larger latency = deeper software pipeline but +## more register pressure. +k_t = ct.load(K, ..., latency=2) # Prefetch K 2 iterations ahead +v_tile = ct.load(V, ..., latency=4) # Prefetch V 4 iterations ahead (used later in the loop) +``` + +#### Occupancy + +Allow multiple thread blocks per SM to hide memory latency. See the [Execution Model docs](https://docs.nvidia.com/cuda/cutile-python/execution.html) for details on how `occupancy` interacts with registers and shared memory. + +```python +## occupancy=N is a hint to the compiler to target N concurrent CTAs per SM. +## Higher occupancy -> more warps available to hide memory latency, +## but constrains the per-CTA register/SMEM budget. +@ct.kernel(occupancy=2) # 2 thread blocks (CTAs) co-resident per SM +def fmha_optimized(...): +``` + +#### Approximate Division + +Use fast approximate division for final normalization. + +```python +from cuda.tile import RoundingMode as RMd +## RMd.APPROX -> hardware approximate reciprocal/divide (MUFU), much faster +## than IEEE-compliant division. Safe here because it's the final softmax +## normalization step where a small ULP error is acceptable. +## flush_to_zero=True flushes denormals to 0 to avoid the slow path. +acc = ct.truediv(acc, l_i, flush_to_zero=True, rounding_mode=RMd.APPROX) +``` + +### Platform Configuration + +The same kernel code works on all platforms; only configuration parameters change. Use [`ct.ByTarget`](https://docs.nvidia.com/cuda/cutile-python/performance.html) to select values per architecture, or [`ct.autotune`](https://docs.nvidia.com/cuda/cutile-python/performance.html) to search candidate values automatically. + +| Platform | TILE_M | TILE_N | Occupancy | Rationale | +|---|---|---|---|---| +| DGX Spark (sm_121) | 64 | 64 | 2 | Smaller tiles, higher occupancy for 48 SMs | +| B300 (sm_103) | 256 | 128 | 1 | Large tiles maximize HBM3e throughput | +| B300 alternate | 128 | 128 | 2 | Higher occupancy, balanced parallelism | + +```python +import cuda.tile as ct + +@ct.kernel( +# # TILE_M / TILE_N: rows/cols of the Q and K/V tiles processed per CTA. +# # Larger tiles -> more arithmetic intensity; smaller tiles -> higher occupancy. +# # occupancy: target concurrent CTAs per SM (latency hiding vs. register pressure). + occupancy=ct.ByTarget({ + "sm_121": 2, # DGX Spark (48 SMs): 2 CTAs/SM for latency hiding + "sm_100": 1, # B300: larger tiles already saturate the SM + "default": 1, # Conservative fallback for other architectures + }), + opt_level=3 # Maximum compiler optimization level +) +def fmha_kernel(...): + ... +``` + +### Performance Results + +> **Note:** PyTorch SDPA is used for correctness verification only, not performance comparison. + +#### DGX Spark (sm_121) β€” Seq 2048 + +| Step | Optimization | Latency (ms) | TFLOPS | +|---|---|---|---| +| 1 | Basic cuTile | 2.19 | 62.8 | +| 2 | + exp2 | 2.07 | 66.5 | +| 3 | + Load Order | 2.07 | 66.3 | +| 4 | + Latency Hints | 2.07 | 66.5 | +| 5 | + Occupancy=2 | 1.73 | 79.5 | +| 6 | + Approx Div (Final) | 1.69 | 81.1 | + +#### B300 (sm_103) β€” Various Seq Lengths + +| Seq Len | Latency (ms) | TFLOPS | vs Spark | +|---|---|---|---| +| 1024 | 0.074 | 465 | 5.7x | +| 2048 | 0.178 | 770 | 9.5x | +| 4096 | 0.550 | 999 | 15.1x | +| 8192 | 1.897 | 1159 | 14.6x | +| 16384 | 7.014 | 1254 | 14.2x | + +### Common Issues + +| Issue | Solution | +|---|---| +| Shape mismatch in ct.mma | Ensure A is (M,K), B is (K,N), C is (M,N) | +| dtype errors | Use `.astype()` before mma; accumulator should be float32 | +| Incorrect results with causal | Check mask_start calculation and `offs_m >= offs_n` logic | +| Low performance | Try different TILE_M/N, check occupancy, verify latency hints | + +### Companion Scripts + +The following scripts are included in this playbook and can be run on DGX Spark or B300: + +- **[`assets/fmha_optimization_tutorial.py`](assets/fmha_optimization_tutorial.py)** β€” Step-by-step optimization tutorial. Builds the FMHA kernel from basic to fully optimized, matching the progression in this guide. +- **[`assets/fmha_scaling_analysis.py`](assets/fmha_scaling_analysis.py)** β€” Scaling analysis across sequence lengths. Benchmarks each optimization level and generates performance data. + +```bash +## Run the optimization tutorial (DGX Spark) +python assets/fmha_optimization_tutorial.py --correctness-check + +## Run the scaling analysis +python assets/fmha_scaling_analysis.py --iterations 100 +``` + +### References + +- [cuTile Python Documentation](https://docs.nvidia.com/cuda/cutile-python/) +- [Tile IR Specification](https://docs.nvidia.com/cuda/tile-ir/) +- [TileGym (pre-optimized kernels)](https://github.com/NVIDIA/TileGym) +- [NVIDIA Blog: Tuning Flash Attention for Peak Performance in CUDA Tile](https://developer.nvidia.com/blog/tuning-flash-attention-for-peak-performance-in-nvidia-cuda-tile/) +- [Flash Attention Paper](https://arxiv.org/abs/2205.14135) + +## Platform Comparison + +## DGX Spark vs B300 Performance Comparison + +This page summarizes performance scaling between DGX Spark (GB10) and B300 for both kernel benchmarks and end-to-end LLM inference. + +## Kernel Benchmark Scaling + +Use the ratios below as a reference for how kernel performance scales from DGX Spark (GB10) to B300. + +| Kernel | Metric | B300 / GB10 | +|--------|--------|-------------| +| FMHA (causal, 8192) | TFLOPS | 13.7x | +| FMHA (non-causal, 8192) | TFLOPS | 15.1x | +| MatMul (8192) | TFLOPS | 18.9x | +| BMM (batch8, 4096) | TFLOPS | 19.4x | +| Group GEMM (4096) | TFLOPS | 23.9x | +| RMSNorm (4096) | GB/s | 33.1x | +| RoPE (16384) | GB/s | 22.8x | + +**Key Observations:** +- Compute-heavy kernels typically scale 14-24x from GB10 to B300 +- Memory-bound kernels can scale 20-33x due to HBM bandwidth advantage + +## Qwen2-7B Performance + +### End-to-End Throughput + +| Configuration | DGX Spark | B300 | Platform Speedup | +|---------------|-----------|------|------------------| +| **cuTile** | 18.52 tok/s | 257.33 tok/s | **13.9x** | + +### CUDA Kernel Time + +| Configuration | DGX Spark | B300 | Platform Speedup | +|---------------|-----------|------|------------------| +| **cuTile** | 43,080 ms | 2,954 ms | **14.6x** | + +### cuTile Kernel Breakdown + +**DGX Spark (GB10):** + +| Kernel | CUDA Time (ms) | Calls | +|--------|----------------|-------| +| `fmha_kernel` | 4,185.9 | 28 | +| `swiglu_forward_kernel` | 2,459.8 | 1,400 | +| `attention_decode_kernel_grouped` | 2,271.8 | 1,372 | +| `rms_norm_kernel_static_persistent` | 634.7 | 57 | +| `rope_kernel` | 355.6 | 1,400 | + +**B300:** + +| Kernel | CUDA Time (ms) | Speedup vs Spark | +|--------|----------------|------------------| +| `fmha_kernel` | 337.9 | 12.4x | +| `swiglu_forward_kernel` | 226.3 | 10.9x | +| `attention_decode_kernel_grouped` | 111.0 | 20.5x | +| `rms_norm_kernel_static_persistent` | 29.7 | 21.4x | +| `rope_kernel` | 16.7 | 21.3x | + +**Same code, different architectures** - cuTile JIT compiles for sm_121 (Spark) and sm_103 (B300) + +## Platform Specifications + +| Specification | DGX Spark (GB10) | B300 | +|---------------|------------------|------| +| Compute Capability | sm_121 (12.1) | sm_103 (10.3) | +| SMs | 48 | 132 | +| Memory | 128 GB LPDDR5x | 192 GB HBM3e | +| Memory Bandwidth | 273 GB/s | 8 TB/s | + +## Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| `docker: permission denied` | User not in docker group | `sudo usermod -aG docker $USER && newgrp docker` | +| `401 Client Error: Unauthorized` | Missing HuggingFace token | `export HF_TOKEN=` | +| `ModuleNotFoundError: tilegym` | TileGym not installed | `cd TileGym && pip install .` | +| `RuntimeError: CUDA out of memory` | Model too large | Reduce batch size or use smaller model | +| `Killed` during model load | Out of system memory | Clear cache: `sync; echo 3 > /proc/sys/vm/drop_caches` | +| Slow first run | JIT compilation | Normal - cuTile compiles kernels on first run | +| `FileNotFoundError: input_prompt_small.txt` | Missing input file | Run from `modeling/transformers` directory | +| `torch.cuda.OutOfMemoryError` | Insufficient GPU memory | Reduce `--batch_size` parameter | +| `ImportError: cuda.tile` | Missing Tile IR | Install: `apt-get install cuda-tile-ir-13-1` | +| Benchmark hangs | GPU busy or locked | Check `nvidia-smi` for other processes | + +> [!NOTE] +> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. +> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within +> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with: + +```bash +sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' +``` + +> [!TIP] +> First run of cuTile kernels includes JIT compilation overhead. Subsequent runs will be faster as compiled kernels are cached. + +For the latest known issues, please review the [DGX Spark User Guide](https://docs.nvidia.com/dgx/dgx-spark/known-issues.html). diff --git a/nvidia/cutile-kernels/assets/fmha_optimization_tutorial.py b/nvidia/cutile-kernels/assets/fmha_optimization_tutorial.py new file mode 100644 index 0000000..09f0f2c --- /dev/null +++ b/nvidia/cutile-kernels/assets/fmha_optimization_tutorial.py @@ -0,0 +1,959 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#!/usr/bin/env python3 +""" +FMHA Optimization Tutorial: From Naive to Optimized cuTile Implementation + +This script demonstrates step-by-step optimization of Flash Multi-Head Attention +using NVIDIA cuTile, starting from a basic implementation and progressively +adding optimizations until reaching TileGym-level performance. + +Target Platform: DGX Spark (sm121) with pre-determined optimal tile sizes. +Note: TileGym supports autotuning, but we use hardcoded values for this tutorial. + +Configuration (matches TileGym bench_fused_attention.py): +- Batch: 4, Heads: 32, Head Dim: 128 +- Sequence Lengths: 1024, 2048, 4096, 8192, 16384 +- Benchmark: triton.testing.do_bench_cudagraph + +Usage: + python fmha_optimization_tutorial.py [--iterations N] [--correctness-check] +""" + +import argparse +import json +import math +import time +from dataclasses import dataclass, asdict +from typing import List, Optional +import sys + +import torch + +LOG_SEPARATOR = "=" * 80 +LOG_SUBSEPARATOR = "-" * 60 + +@dataclass +class BenchmarkResult: + step: int + name: str + description: str + latency_ms: float + tflops: float + speedup_vs_baseline: float + speedup_vs_previous: float + correct: bool + key_changes: List[str] + +class Logger: + def __init__(self): + self.results: List[BenchmarkResult] = [] + self.logs: List[str] = [] + + def log(self, msg: str): + print(msg) + self.logs.append(msg) + + def section(self, title: str): + self.log(f"\n{LOG_SEPARATOR}") + self.log(f" {title}") + self.log(LOG_SEPARATOR) + + def subsection(self, title: str): + self.log(f"\n{LOG_SUBSEPARATOR}") + self.log(f" {title}") + self.log(LOG_SUBSEPARATOR) + + def add_result(self, result: BenchmarkResult): + self.results.append(result) + + def export_json(self, filepath: str): + data = { + "results": [asdict(r) for r in self.results], + "logs": self.logs + } + with open(filepath, 'w') as f: + json.dump(data, f, indent=2) + + def export_markdown(self, filepath: str): + with open(filepath, 'w') as f: + f.write("# FMHA Optimization Tutorial Results\n\n") + f.write("## Summary Table\n\n") + f.write("| Step | Name | Latency (ms) | TFLOPS | vs Baseline | vs Previous | Correct |\n") + f.write("|------|------|--------------|--------|-------------|-------------|--------|\n") + for r in self.results: + f.write(f"| {r.step} | {r.name} | {r.latency_ms:.3f} | {r.tflops:.2f} | {r.speedup_vs_baseline:.2f}x | {r.speedup_vs_previous:.2f}x | {'Yes' if r.correct else 'No'} |\n") + f.write("\n## Detailed Steps\n\n") + for r in self.results: + f.write(f"### Step {r.step}: {r.name}\n\n") + f.write(f"**Description**: {r.description}\n\n") + f.write("**Key Changes**:\n") + for change in r.key_changes: + f.write(f"- {change}\n") + f.write(f"\n**Performance**: {r.latency_ms:.3f}ms, {r.tflops:.2f} TFLOPS, {r.speedup_vs_baseline:.2f}x vs baseline\n\n") + +logger = Logger() + +BATCH = 4 +N_HEADS = 32 +HEAD_DIM = 128 +INV_LOG_2 = 1.0 / math.log(2) + +TILE_M = 64 +TILE_N = 64 +OCCUPANCY = 2 +NUM_CTAS = 1 + +def get_device(): + if torch.cuda.is_available(): + return torch.device("cuda") + raise RuntimeError("CUDA not available") + +DEVICE = None + +def compute_flops(batch, heads, seq_len, head_dim, causal=True): + flops_per_matmul = 2.0 * batch * heads * seq_len * seq_len * head_dim + total_flops = 2 * flops_per_matmul + if causal: + total_flops *= 0.5 + return total_flops + +def benchmark_fn(fn, warmup=10, iterations=100): + """Benchmark using triton's do_bench_cudagraph for accurate timing (matches TileGym)""" + try: + import triton + ms = triton.testing.do_bench_cudagraph(fn) + return ms + except (ImportError, Exception): + for _ in range(warmup): + fn() + torch.cuda.synchronize() + + start = time.perf_counter() + for _ in range(iterations): + fn() + torch.cuda.synchronize() + end = time.perf_counter() + + return (end - start) / iterations * 1000 + +def verify_correctness(output, reference, atol=1e-2, rtol=1e-2): + try: + torch.testing.assert_close(output, reference, atol=atol, rtol=rtol) + return True + except AssertionError: + max_diff = (output - reference).abs().max().item() + logger.log(f" [WARN] Max difference: {max_diff:.6f}") + return max_diff < 0.1 + +def reference_fmha(q, k, v, sm_scale, is_causal=True): + return torch.nn.functional.scaled_dot_product_attention( + q, k, v, attn_mask=None, dropout_p=0.0, is_causal=is_causal, scale=sm_scale + ) + +def step0_pytorch_baseline(q, k, v, sm_scale, is_causal=True): + return reference_fmha(q, k, v, sm_scale, is_causal) + +try: + import cuda.tile as ct + from cuda.tile import RoundingMode as RMd + CUTILE_AVAILABLE = True +except ImportError: + CUTILE_AVAILABLE = False + logger.log("[WARN] cuTile not available. Only PyTorch baseline will run.") + +if CUTILE_AVAILABLE: + ConstInt = ct.Constant[int] + ConstBool = ct.Constant[bool] + + @ct.kernel() + def fmha_step2_mma( + Q, K, V, Out, + qk_scale: float, + TILE_D: ConstInt, + H: ConstInt, + TILE_M: ConstInt, + TILE_N: ConstInt, + CAUSAL: ConstBool, + ): + """ + Step 2: Basic cuTile FMHA with MMA (Tensor Cores) + - Uses ct.mma() for matrix multiply + - Standard exp() for softmax + - Online softmax algorithm + """ + bid_x = ct.bid(0) + bid_y = ct.bid(1) + batch_idx = bid_y // H + head_idx = bid_y % H + + offs_m = bid_x * TILE_M + ct.arange(TILE_M, dtype=ct.int32) + offs_m = offs_m[:, None] + + offs_n_tile = ct.arange(TILE_N, dtype=ct.int32) + offs_n_tile = offs_n_tile[None, :] + + m_i = ct.full((TILE_M, 1), -math.inf, dtype=ct.float32) + l_i = ct.full((TILE_M, 1), 0.0, dtype=ct.float32) + acc = ct.full((TILE_M, TILE_D), 0.0, dtype=ct.float32) + + q = ct.load(Q, index=(batch_idx, head_idx, bid_x, 0), shape=(1, 1, TILE_M, TILE_D)) + q = q.reshape((TILE_M, TILE_D)) + + k_seqlen = K.shape[2] + if CAUSAL: + m_end = (bid_x + 1) * TILE_M + Tc = ct.cdiv(min(m_end, k_seqlen), TILE_N) + mask_start = (bid_x * TILE_M) // TILE_N + else: + Tc = ct.cdiv(k_seqlen, TILE_N) + mask_start = k_seqlen // TILE_N + + for j in range(0, Tc): + k_tile = ct.load(K, index=(batch_idx, head_idx, j, 0), shape=(1, 1, TILE_N, TILE_D)) + k_tile = k_tile.reshape((TILE_N, TILE_D)) + k_t = ct.permute(k_tile, (1, 0)) + + qk = ct.full((TILE_M, TILE_N), 0.0, dtype=ct.float32) + qk = ct.mma(q, k_t, qk) + qk = qk * qk_scale + + if CAUSAL and j >= mask_start: + offs_n = j * TILE_N + offs_n_tile + mask = offs_m >= offs_n + qk = ct.where(mask, qk, ct.full((TILE_M, TILE_N), -math.inf, dtype=ct.float32)) + + m_ij = ct.max(qk, axis=-1, keepdims=True) + m_ij = ct.maximum(m_i, m_ij) + qk = qk - m_ij + + p = ct.exp(qk) + l_ij = ct.sum(p, axis=-1, keepdims=True) + alpha = ct.exp(m_i - m_ij) + l_i = l_i * alpha + l_ij + acc = acc * alpha + + v_tile = ct.load(V, index=(batch_idx, head_idx, j, 0), shape=(1, 1, TILE_N, TILE_D)) + v_tile = v_tile.reshape((TILE_N, TILE_D)) + p_cast = p.astype(Q.dtype) + acc = ct.mma(p_cast, v_tile, acc) + + m_i = m_ij + + acc = acc / l_i + acc = acc.reshape((1, 1, TILE_M, TILE_D)).astype(Out.dtype) + ct.store(Out, index=(batch_idx, head_idx, bid_x, 0), tile=acc) + + def run_step2(q, k, v, sm_scale, is_causal=True): + batch_size, num_heads, seq_len, head_dim = q.shape + o = torch.empty_like(q) + grid = (math.ceil(seq_len / TILE_M), batch_size * num_heads, 1) + ct.launch( + torch.cuda.current_stream(), + grid, + fmha_step2_mma, + (q, k, v, o, sm_scale, head_dim, num_heads, TILE_M, TILE_N, is_causal) + ) + return o + + @ct.kernel() + def fmha_step3_exp2( + Q, K, V, Out, + qk_scale: float, + TILE_D: ConstInt, + H: ConstInt, + TILE_M: ConstInt, + TILE_N: ConstInt, + CAUSAL: ConstBool, + ): + """ + Step 3: Use exp2 with flush_to_zero for faster math + - exp2(x) = 2^x is faster than exp(x) = e^x on GPU + - Requires scaling adjustment: multiply by 1/log(2) + - flush_to_zero handles denormals efficiently + """ + bid_x = ct.bid(0) + bid_y = ct.bid(1) + batch_idx = bid_y // H + head_idx = bid_y % H + + qk_scale_log2 = qk_scale * INV_LOG_2 + + offs_m = bid_x * TILE_M + ct.arange(TILE_M, dtype=ct.int32) + offs_m = offs_m[:, None] + + offs_n_tile = ct.arange(TILE_N, dtype=ct.int32) + offs_n_tile = offs_n_tile[None, :] + + m_i = ct.full((TILE_M, 1), -math.inf, dtype=ct.float32) + l_i = ct.full((TILE_M, 1), 0.0, dtype=ct.float32) + acc = ct.full((TILE_M, TILE_D), 0.0, dtype=ct.float32) + + q = ct.load(Q, index=(batch_idx, head_idx, bid_x, 0), shape=(1, 1, TILE_M, TILE_D)) + q = q.reshape((TILE_M, TILE_D)) + + k_seqlen = K.shape[2] + if CAUSAL: + m_end = (bid_x + 1) * TILE_M + Tc = ct.cdiv(min(m_end, k_seqlen), TILE_N) + mask_start = (bid_x * TILE_M) // TILE_N + else: + Tc = ct.cdiv(k_seqlen, TILE_N) + mask_start = k_seqlen // TILE_N + + for j in range(0, Tc): + k_tile = ct.load(K, index=(batch_idx, head_idx, j, 0), shape=(1, 1, TILE_N, TILE_D)) + k_tile = k_tile.reshape((TILE_N, TILE_D)) + k_t = ct.permute(k_tile, (1, 0)) + + qk = ct.full((TILE_M, TILE_N), 0.0, dtype=ct.float32) + qk = ct.mma(q, k_t, qk) + + if CAUSAL and j >= mask_start: + offs_n = j * TILE_N + offs_n_tile + mask = offs_m >= offs_n + qk = ct.where(mask, qk, ct.full((TILE_M, TILE_N), -math.inf, dtype=ct.float32)) + + m_ij = ct.max(qk, axis=-1, keepdims=True) * qk_scale_log2 + m_ij = ct.maximum(m_i, m_ij) + qk = qk * qk_scale_log2 - m_ij + + p = ct.exp2(qk, flush_to_zero=True) + l_ij = ct.sum(p, axis=-1, keepdims=True) + alpha = ct.exp2(m_i - m_ij, flush_to_zero=True) + l_i = l_i * alpha + l_ij + acc = acc * alpha + + v_tile = ct.load(V, index=(batch_idx, head_idx, j, 0), shape=(1, 1, TILE_N, TILE_D)) + v_tile = v_tile.reshape((TILE_N, TILE_D)) + p_cast = p.astype(Q.dtype) + acc = ct.mma(p_cast, v_tile, acc) + + m_i = m_ij + + acc = acc / l_i + acc = acc.reshape((1, 1, TILE_M, TILE_D)).astype(Out.dtype) + ct.store(Out, index=(batch_idx, head_idx, bid_x, 0), tile=acc) + + def run_step3(q, k, v, sm_scale, is_causal=True): + batch_size, num_heads, seq_len, head_dim = q.shape + o = torch.empty_like(q) + grid = (math.ceil(seq_len / TILE_M), batch_size * num_heads, 1) + ct.launch( + torch.cuda.current_stream(), + grid, + fmha_step3_exp2, + (q, k, v, o, sm_scale, head_dim, num_heads, TILE_M, TILE_N, is_causal) + ) + return o + + @ct.kernel() + def fmha_step4_load_order( + Q, K, V, Out, + qk_scale: float, + TILE_D: ConstInt, + H: ConstInt, + TILE_M: ConstInt, + TILE_N: ConstInt, + CAUSAL: ConstBool, + ): + """ + Step 4: Optimize K load with order parameter + - Use order=(0,1,3,2) to load K already transposed + - Avoids explicit ct.permute() operation + - Reduces memory traffic + """ + bid_x = ct.bid(0) + bid_y = ct.bid(1) + batch_idx = bid_y // H + head_idx = bid_y % H + + qk_scale_log2 = qk_scale * INV_LOG_2 + + offs_m = bid_x * TILE_M + ct.arange(TILE_M, dtype=ct.int32) + offs_m = offs_m[:, None] + + offs_n_tile = ct.arange(TILE_N, dtype=ct.int32) + offs_n_tile = offs_n_tile[None, :] + + m_i = ct.full((TILE_M, 1), -math.inf, dtype=ct.float32) + l_i = ct.full((TILE_M, 1), 0.0, dtype=ct.float32) + acc = ct.full((TILE_M, TILE_D), 0.0, dtype=ct.float32) + + q = ct.load(Q, index=(batch_idx, head_idx, bid_x, 0), shape=(1, 1, TILE_M, TILE_D)) + q = q.reshape((TILE_M, TILE_D)) + + k_seqlen = K.shape[2] + if CAUSAL: + m_end = (bid_x + 1) * TILE_M + Tc = ct.cdiv(min(m_end, k_seqlen), TILE_N) + mask_start = (bid_x * TILE_M) // TILE_N + else: + Tc = ct.cdiv(k_seqlen, TILE_N) + mask_start = k_seqlen // TILE_N + + for j in range(0, Tc): + k_t = ct.load( + K, + index=(batch_idx, head_idx, 0, j), + shape=(1, 1, TILE_D, TILE_N), + order=(0, 1, 3, 2) + ) + k_t = k_t.reshape((TILE_D, TILE_N)) + + qk = ct.full((TILE_M, TILE_N), 0.0, dtype=ct.float32) + qk = ct.mma(q, k_t, qk) + + if CAUSAL and j >= mask_start: + offs_n = j * TILE_N + offs_n_tile + mask = offs_m >= offs_n + qk = ct.where(mask, qk, ct.full((TILE_M, TILE_N), -math.inf, dtype=ct.float32)) + + m_ij = ct.max(qk, axis=-1, keepdims=True) * qk_scale_log2 + m_ij = ct.maximum(m_i, m_ij) + qk = qk * qk_scale_log2 - m_ij + + p = ct.exp2(qk, flush_to_zero=True) + l_ij = ct.sum(p, axis=-1, keepdims=True) + alpha = ct.exp2(m_i - m_ij, flush_to_zero=True) + l_i = l_i * alpha + l_ij + acc = acc * alpha + + v_tile = ct.load(V, index=(batch_idx, head_idx, j, 0), shape=(1, 1, TILE_N, TILE_D)) + v_tile = v_tile.reshape((TILE_N, TILE_D)) + p_cast = p.astype(Q.dtype) + acc = ct.mma(p_cast, v_tile, acc) + + m_i = m_ij + + acc = acc / l_i + acc = acc.reshape((1, 1, TILE_M, TILE_D)).astype(Out.dtype) + ct.store(Out, index=(batch_idx, head_idx, bid_x, 0), tile=acc) + + def run_step4(q, k, v, sm_scale, is_causal=True): + batch_size, num_heads, seq_len, head_dim = q.shape + o = torch.empty_like(q) + grid = (math.ceil(seq_len / TILE_M), batch_size * num_heads, 1) + ct.launch( + torch.cuda.current_stream(), + grid, + fmha_step4_load_order, + (q, k, v, o, sm_scale, head_dim, num_heads, TILE_M, TILE_N, is_causal) + ) + return o + + @ct.kernel() + def fmha_step5_latency( + Q, K, V, Out, + qk_scale: float, + TILE_D: ConstInt, + H: ConstInt, + TILE_M: ConstInt, + TILE_N: ConstInt, + CAUSAL: ConstBool, + ): + """ + Step 5: Add latency hints for better pipelining + - latency=2 for K load (prefetch) + - latency=4 for V load (more prefetch distance) + - Helps overlap memory loads with computation + """ + bid_x = ct.bid(0) + bid_y = ct.bid(1) + batch_idx = bid_y // H + head_idx = bid_y % H + + qk_scale_log2 = qk_scale * INV_LOG_2 + + offs_m = bid_x * TILE_M + ct.arange(TILE_M, dtype=ct.int32) + offs_m = offs_m[:, None] + + offs_n_tile = ct.arange(TILE_N, dtype=ct.int32) + offs_n_tile = offs_n_tile[None, :] + + m_i = ct.full((TILE_M, 1), -math.inf, dtype=ct.float32) + l_i = ct.full((TILE_M, 1), 0.0, dtype=ct.float32) + acc = ct.full((TILE_M, TILE_D), 0.0, dtype=ct.float32) + + q = ct.load(Q, index=(batch_idx, head_idx, bid_x, 0), shape=(1, 1, TILE_M, TILE_D)) + q = q.reshape((TILE_M, TILE_D)) + + k_seqlen = K.shape[2] + if CAUSAL: + m_end = (bid_x + 1) * TILE_M + Tc = ct.cdiv(min(m_end, k_seqlen), TILE_N) + mask_start = (bid_x * TILE_M) // TILE_N + else: + Tc = ct.cdiv(k_seqlen, TILE_N) + mask_start = k_seqlen // TILE_N + + for j in range(0, Tc): + k_t = ct.load( + K, + index=(batch_idx, head_idx, 0, j), + shape=(1, 1, TILE_D, TILE_N), + order=(0, 1, 3, 2), + latency=2 + ) + k_t = k_t.reshape((TILE_D, TILE_N)) + + qk = ct.full((TILE_M, TILE_N), 0.0, dtype=ct.float32) + qk = ct.mma(q, k_t, qk) + + if CAUSAL and j >= mask_start: + offs_n = j * TILE_N + offs_n_tile + mask = offs_m >= offs_n + qk = ct.where(mask, qk, ct.full((TILE_M, TILE_N), -math.inf, dtype=ct.float32)) + + m_ij = ct.max(qk, axis=-1, keepdims=True) * qk_scale_log2 + m_ij = ct.maximum(m_i, m_ij) + qk = qk * qk_scale_log2 - m_ij + + p = ct.exp2(qk, flush_to_zero=True) + l_ij = ct.sum(p, axis=-1, keepdims=True) + alpha = ct.exp2(m_i - m_ij, flush_to_zero=True) + l_i = l_i * alpha + l_ij + acc = acc * alpha + + v_tile = ct.load( + V, + index=(batch_idx, head_idx, j, 0), + shape=(1, 1, TILE_N, TILE_D), + latency=4 + ) + v_tile = v_tile.reshape((TILE_N, TILE_D)) + p_cast = p.astype(Q.dtype) + acc = ct.mma(p_cast, v_tile, acc) + + m_i = m_ij + + acc = acc / l_i + acc = acc.reshape((1, 1, TILE_M, TILE_D)).astype(Out.dtype) + ct.store(Out, index=(batch_idx, head_idx, bid_x, 0), tile=acc) + + def run_step5(q, k, v, sm_scale, is_causal=True): + batch_size, num_heads, seq_len, head_dim = q.shape + o = torch.empty_like(q) + grid = (math.ceil(seq_len / TILE_M), batch_size * num_heads, 1) + ct.launch( + torch.cuda.current_stream(), + grid, + fmha_step5_latency, + (q, k, v, o, sm_scale, head_dim, num_heads, TILE_M, TILE_N, is_causal) + ) + return o + + @ct.kernel(occupancy=2) + def fmha_step6_occupancy( + Q, K, V, Out, + qk_scale: float, + TILE_D: ConstInt, + H: ConstInt, + TILE_M: ConstInt, + TILE_N: ConstInt, + CAUSAL: ConstBool, + ): + """ + Step 6: Add occupancy hint + - @ct.kernel(occupancy=2) improves SM utilization + - Allows multiple thread blocks per SM + - Better for hiding memory latency + """ + bid_x = ct.bid(0) + bid_y = ct.bid(1) + batch_idx = bid_y // H + head_idx = bid_y % H + + qk_scale_log2 = qk_scale * INV_LOG_2 + + offs_m = bid_x * TILE_M + ct.arange(TILE_M, dtype=ct.int32) + offs_m = offs_m[:, None] + + offs_n_tile = ct.arange(TILE_N, dtype=ct.int32) + offs_n_tile = offs_n_tile[None, :] + + m_i = ct.full((TILE_M, 1), -math.inf, dtype=ct.float32) + l_i = ct.full((TILE_M, 1), 0.0, dtype=ct.float32) + acc = ct.full((TILE_M, TILE_D), 0.0, dtype=ct.float32) + + q = ct.load(Q, index=(batch_idx, head_idx, bid_x, 0), shape=(1, 1, TILE_M, TILE_D)) + q = q.reshape((TILE_M, TILE_D)) + + k_seqlen = K.shape[2] + if CAUSAL: + m_end = (bid_x + 1) * TILE_M + Tc = ct.cdiv(min(m_end, k_seqlen), TILE_N) + mask_start = (bid_x * TILE_M) // TILE_N + else: + Tc = ct.cdiv(k_seqlen, TILE_N) + mask_start = k_seqlen // TILE_N + + for j in range(0, Tc): + k_t = ct.load( + K, + index=(batch_idx, head_idx, 0, j), + shape=(1, 1, TILE_D, TILE_N), + order=(0, 1, 3, 2), + latency=2 + ) + k_t = k_t.reshape((TILE_D, TILE_N)) + + qk = ct.full((TILE_M, TILE_N), 0.0, dtype=ct.float32) + qk = ct.mma(q, k_t, qk) + + if CAUSAL and j >= mask_start: + offs_n = j * TILE_N + offs_n_tile + mask = offs_m >= offs_n + qk = ct.where(mask, qk, ct.full((TILE_M, TILE_N), -math.inf, dtype=ct.float32)) + + m_ij = ct.max(qk, axis=-1, keepdims=True) * qk_scale_log2 + m_ij = ct.maximum(m_i, m_ij) + qk = qk * qk_scale_log2 - m_ij + + p = ct.exp2(qk, flush_to_zero=True) + l_ij = ct.sum(p, axis=-1, keepdims=True) + alpha = ct.exp2(m_i - m_ij, flush_to_zero=True) + l_i = l_i * alpha + l_ij + acc = acc * alpha + + v_tile = ct.load( + V, + index=(batch_idx, head_idx, j, 0), + shape=(1, 1, TILE_N, TILE_D), + latency=4 + ) + v_tile = v_tile.reshape((TILE_N, TILE_D)) + p_cast = p.astype(Q.dtype) + acc = ct.mma(p_cast, v_tile, acc) + + m_i = m_ij + + acc = acc / l_i + acc = acc.reshape((1, 1, TILE_M, TILE_D)).astype(Out.dtype) + ct.store(Out, index=(batch_idx, head_idx, bid_x, 0), tile=acc) + + def run_step6(q, k, v, sm_scale, is_causal=True): + batch_size, num_heads, seq_len, head_dim = q.shape + o = torch.empty_like(q) + grid = (math.ceil(seq_len / TILE_M), batch_size * num_heads, 1) + ct.launch( + torch.cuda.current_stream(), + grid, + fmha_step6_occupancy, + (q, k, v, o, sm_scale, head_dim, num_heads, TILE_M, TILE_N, is_causal) + ) + return o + + @ct.kernel(occupancy=2) + def fmha_step7_approx_div( + Q, K, V, Out, + qk_scale: float, + TILE_D: ConstInt, + H: ConstInt, + TILE_M: ConstInt, + TILE_N: ConstInt, + CAUSAL: ConstBool, + ): + """ + Step 7: Use approximate division for final normalization + - ct.truediv with rounding_mode=APPROX is faster + - Acceptable accuracy loss for inference + - This matches TileGym's optimized implementation + """ + bid_x = ct.bid(0) + bid_y = ct.bid(1) + batch_idx = bid_y // H + head_idx = bid_y % H + + qk_scale_log2 = qk_scale * INV_LOG_2 + + offs_m = bid_x * TILE_M + ct.arange(TILE_M, dtype=ct.int32) + offs_m = offs_m[:, None] + + offs_n_tile = ct.arange(TILE_N, dtype=ct.int32) + offs_n_tile = offs_n_tile[None, :] + + m_i = ct.full((TILE_M, 1), -math.inf, dtype=ct.float32) + l_i = ct.full((TILE_M, 1), 0.0, dtype=ct.float32) + acc = ct.full((TILE_M, TILE_D), 0.0, dtype=ct.float32) + + q = ct.load(Q, index=(batch_idx, head_idx, bid_x, 0), shape=(1, 1, TILE_M, TILE_D)) + q = q.reshape((TILE_M, TILE_D)) + + k_seqlen = K.shape[2] + if CAUSAL: + m_end = (bid_x + 1) * TILE_M + Tc = ct.cdiv(min(m_end, k_seqlen), TILE_N) + mask_start = (bid_x * TILE_M) // TILE_N + else: + Tc = ct.cdiv(k_seqlen, TILE_N) + mask_start = k_seqlen // TILE_N + + for j in range(0, Tc): + k_t = ct.load( + K, + index=(batch_idx, head_idx, 0, j), + shape=(1, 1, TILE_D, TILE_N), + order=(0, 1, 3, 2), + latency=2 + ) + k_t = k_t.reshape((TILE_D, TILE_N)) + + qk = ct.full((TILE_M, TILE_N), 0.0, dtype=ct.float32) + qk = ct.mma(q, k_t, qk) + + if CAUSAL and j >= mask_start: + offs_n = j * TILE_N + offs_n_tile + mask = offs_m >= offs_n + qk = ct.where(mask, qk, ct.full((TILE_M, TILE_N), -math.inf, dtype=ct.float32)) + + m_ij = ct.max(qk, axis=-1, keepdims=True) * qk_scale_log2 + m_ij = ct.maximum(m_i, m_ij) + qk = qk * qk_scale_log2 - m_ij + + p = ct.exp2(qk, flush_to_zero=True) + l_ij = ct.sum(p, axis=-1, keepdims=True) + alpha = ct.exp2(m_i - m_ij, flush_to_zero=True) + l_i = l_i * alpha + l_ij + acc = acc * alpha + + v_tile = ct.load( + V, + index=(batch_idx, head_idx, j, 0), + shape=(1, 1, TILE_N, TILE_D), + latency=4 + ) + v_tile = v_tile.reshape((TILE_N, TILE_D)) + p_cast = p.astype(Q.dtype) + acc = ct.mma(p_cast, v_tile, acc) + + m_i = m_ij + + acc = ct.truediv(acc, l_i, flush_to_zero=True, rounding_mode=RMd.APPROX) + acc = acc.reshape((1, 1, TILE_M, TILE_D)).astype(Out.dtype) + ct.store(Out, index=(batch_idx, head_idx, bid_x, 0), tile=acc) + + def run_step7(q, k, v, sm_scale, is_causal=True): + batch_size, num_heads, seq_len, head_dim = q.shape + o = torch.empty_like(q) + grid = (math.ceil(seq_len / TILE_M), batch_size * num_heads, 1) + ct.launch( + torch.cuda.current_stream(), + grid, + fmha_step7_approx_div, + (q, k, v, o, sm_scale, head_dim, num_heads, TILE_M, TILE_N, is_causal) + ) + return o + + +def run_tilegym_fmha(q, k, v, sm_scale, is_causal=True): + """Run TileGym's optimized FMHA for comparison""" + try: + import tilegym + return tilegym.ops.fmha(q, k, v, scaling=sm_scale, is_causal=is_causal, backend="cutile") + except ImportError: + logger.log("[WARN] TileGym not available for comparison") + return None + + +def run_benchmark(seq_len, iterations=100, check_correct=True): + global DEVICE + DEVICE = get_device() + + logger.section(f"FMHA OPTIMIZATION TUTORIAL - SEQ_LEN={seq_len}") + logger.log("Configuration:") + logger.log(f" - Batch: {BATCH}") + logger.log(f" - Heads: {N_HEADS}") + logger.log(f" - Head Dim: {HEAD_DIM}") + logger.log(f" - Sequence Length: {seq_len}") + logger.log(f" - Tile M: {TILE_M}") + logger.log(f" - Tile N: {TILE_N}") + logger.log(" - Precision: float16") + logger.log(" - Causal: True") + logger.log(f" - Iterations: {iterations}") + logger.log(f" - Device: {DEVICE}") + + q = torch.randn(BATCH, N_HEADS, seq_len, HEAD_DIM, dtype=torch.float16, device=DEVICE) + k = torch.randn(BATCH, N_HEADS, seq_len, HEAD_DIM, dtype=torch.float16, device=DEVICE) + v = torch.randn(BATCH, N_HEADS, seq_len, HEAD_DIM, dtype=torch.float16, device=DEVICE) + sm_scale = 1.0 / math.sqrt(HEAD_DIM) + + flops = compute_flops(BATCH, N_HEADS, seq_len, HEAD_DIM, causal=True) + + ref_output = reference_fmha(q, k, v, sm_scale, is_causal=True) + + steps = [ + (0, "PyTorch Baseline", "torch.nn.functional.scaled_dot_product_attention", + lambda: step0_pytorch_baseline(q, k, v, sm_scale, is_causal=True), + ["PyTorch SDPA with cuDNN backend", "Highly optimized baseline"]), + ] + + if CUTILE_AVAILABLE: + steps.extend([ + (2, "Basic cuTile + MMA", "Tiled FMHA with ct.mma() for Tensor Cores", + lambda: run_step2(q, k, v, sm_scale, is_causal=True), + ["@ct.kernel decorator", "ct.mma() for QK and PV products", "Online softmax with exp()"]), + + (3, "+ exp2 + flush_to_zero", "Faster exponential math", + lambda: run_step3(q, k, v, sm_scale, is_causal=True), + ["ct.exp2() instead of ct.exp()", "flush_to_zero=True for denormals", "qk_scale *= 1/log(2)"]), + + (4, "+ Load Order Transpose", "Avoid explicit transpose", + lambda: run_step4(q, k, v, sm_scale, is_causal=True), + ["order=(0,1,3,2) for K load", "K loaded already transposed", "Removes ct.permute() call"]), + + (5, "+ Latency Hints", "Better memory pipelining", + lambda: run_step5(q, k, v, sm_scale, is_causal=True), + ["latency=2 for K load", "latency=4 for V load", "Overlaps loads with compute"]), + + (6, "+ Occupancy=2", "Better SM utilization", + lambda: run_step6(q, k, v, sm_scale, is_causal=True), + ["@ct.kernel(occupancy=2)", "Multiple blocks per SM", "Hides memory latency"]), + + (7, "+ Approx Division (Final)", "Fast final normalization", + lambda: run_step7(q, k, v, sm_scale, is_causal=True), + ["ct.truediv with APPROX mode", "Matches TileGym implementation", "Full optimization achieved"]), + ]) + + baseline_latency = None + prev_latency = None + + for step_idx, name, desc, fn, changes in steps: + logger.subsection(f"Step {step_idx}: {name}") + logger.log(f"Description: {desc}") + logger.log("Key Changes:") + for change in changes: + logger.log(f" - {change}") + + try: + output = fn() + latency_ms = benchmark_fn(fn, warmup=10, iterations=iterations) + tflops = flops * 1e-12 / (latency_ms * 1e-3) + + if baseline_latency is None: + baseline_latency = latency_ms + speedup_baseline = 1.0 + else: + speedup_baseline = baseline_latency / latency_ms + + if prev_latency is None: + speedup_prev = 1.0 + else: + speedup_prev = prev_latency / latency_ms + + if check_correct and output is not None: + correct = verify_correctness(output, ref_output) + else: + correct = True + + logger.log("\nResults:") + logger.log(f" Latency: {latency_ms:.3f} ms") + logger.log(f" TFLOPS: {tflops:.2f}") + logger.log(f" vs Baseline: {speedup_baseline:.2f}x") + logger.log(f" vs Previous: {speedup_prev:.2f}x") + logger.log(f" Correct: {'Yes' if correct else 'No'}") + + result = BenchmarkResult( + step=step_idx, + name=name, + description=desc, + latency_ms=latency_ms, + tflops=tflops, + speedup_vs_baseline=speedup_baseline, + speedup_vs_previous=speedup_prev, + correct=correct, + key_changes=changes + ) + logger.add_result(result) + + prev_latency = latency_ms + + except Exception as e: + logger.log(f"\n[ERROR] Step {step_idx} failed: {e}") + import traceback + logger.log(traceback.format_exc()) + + tilegym_output = run_tilegym_fmha(q, k, v, sm_scale, is_causal=True) + if tilegym_output is not None: + logger.subsection("TileGym Reference (for comparison)") + tilegym_fn = lambda: run_tilegym_fmha(q, k, v, sm_scale, is_causal=True) + tilegym_latency = benchmark_fn(tilegym_fn, warmup=10, iterations=iterations) + tilegym_tflops = flops * 1e-12 / (tilegym_latency * 1e-3) + tilegym_speedup = baseline_latency / tilegym_latency if baseline_latency else 1.0 + + logger.log("TileGym FMHA:") + logger.log(f" Latency: {tilegym_latency:.3f} ms") + logger.log(f" TFLOPS: {tilegym_tflops:.2f}") + logger.log(f" vs Baseline: {tilegym_speedup:.2f}x") + + result = BenchmarkResult( + step=99, + name="TileGym Reference", + description="TileGym's optimized FMHA implementation", + latency_ms=tilegym_latency, + tflops=tilegym_tflops, + speedup_vs_baseline=tilegym_speedup, + speedup_vs_previous=1.0, + correct=True, + key_changes=["Full TileGym implementation", "Pre-tuned for sm121", "Production ready"] + ) + logger.add_result(result) + + +def main(): + parser = argparse.ArgumentParser(description="FMHA Optimization Tutorial") + parser.add_argument("--iterations", type=int, default=100, help="Benchmark iterations") + parser.add_argument("--seq-len", type=int, default=2048, help="Sequence length (default matches TileGym)") + parser.add_argument("--correctness-check", action="store_true", help="Enable correctness checking") + parser.add_argument("--output-dir", type=str, default=".", help="Output directory for logs") + args = parser.parse_args() + + logger.section("FMHA OPTIMIZATION TUTORIAL") + logger.log("From Naive to Optimized cuTile Implementation") + logger.log("Target Platform: DGX Spark (sm121)") + logger.log(f"Tile Sizes: TILE_M={TILE_M}, TILE_N={TILE_N} (hardcoded from TileGym)") + logger.log("Note: TileGym supports autotuning, but we use pre-determined optimal values") + + run_benchmark( + seq_len=args.seq_len, + iterations=args.iterations, + check_correct=args.correctness_check + ) + + logger.section("FINAL SUMMARY") + logger.log("\n| Step | Name | Latency (ms) | TFLOPS | vs Baseline | Correct |") + logger.log("|------|------|--------------|--------|-------------|---------|") + for r in logger.results: + logger.log(f"| {r.step} | {r.name} | {r.latency_ms:.3f} | {r.tflops:.2f} | {r.speedup_vs_baseline:.2f}x | {'Yes' if r.correct else 'No'} |") + + json_path = f"{args.output_dir}/fmha_tutorial_results.json" + md_path = f"{args.output_dir}/fmha_tutorial_results.md" + log_path = f"{args.output_dir}/fmha_tutorial_log.txt" + + logger.export_json(json_path) + logger.export_markdown(md_path) + + with open(log_path, 'w') as f: + f.write('\n'.join(logger.logs)) + + logger.log("\nResults exported to:") + logger.log(f" - {json_path}") + logger.log(f" - {md_path}") + logger.log(f" - {log_path}") + + +if __name__ == "__main__": + main() diff --git a/nvidia/cutile-kernels/assets/fmha_scaling_analysis.py b/nvidia/cutile-kernels/assets/fmha_scaling_analysis.py new file mode 100644 index 0000000..754b9ca --- /dev/null +++ b/nvidia/cutile-kernels/assets/fmha_scaling_analysis.py @@ -0,0 +1,891 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#!/usr/bin/env python3 +""" +FMHA Scaling Analysis: How Optimizations Impact Performance at Different Sizes + +This script demonstrates: +1. How FMHA performance scales with sequence length +2. Which optimizations provide the most benefit at larger sizes +3. Target-specific configurations for different GPU architectures + +Target Platforms (from TileGym): +- DGX Spark (sm120/sm121): TILE_M=64, TILE_N=64, num_ctas=1, occupancy=2 +- Blackwell B300 (sm100): TILE_M=256, TILE_N=128 or 128x128, num_ctas=1, occupancy=1-2 + +Usage: + python fmha_scaling_analysis.py [--iterations N] +""" + +import argparse +import json +import math +import time +from dataclasses import dataclass, asdict +from typing import List +from types import SimpleNamespace + +import torch + +LOG_SEPARATOR = "=" * 80 +LOG_SUBSEPARATOR = "-" * 60 + +@dataclass +class StepResult: + step: int + name: str + latency_ms: float + tflops: float + speedup_vs_baseline: float + +@dataclass +class SeqLenResult: + seq_len: int + steps: List[StepResult] + best_step: int + best_speedup: float + tilegym_latency_ms: float + tilegym_tflops: float + tilegym_speedup: float + +class Logger: + def __init__(self): + self.results: List[SeqLenResult] = [] + self.logs: List[str] = [] + + def log(self, msg: str): + print(msg) + self.logs.append(msg) + + def section(self, title: str): + self.log(f"\n{LOG_SEPARATOR}") + self.log(f" {title}") + self.log(LOG_SEPARATOR) + + def subsection(self, title: str): + self.log(f"\n{LOG_SUBSEPARATOR}") + self.log(f" {title}") + self.log(LOG_SUBSEPARATOR) + +logger = Logger() + +BATCH = 4 +N_HEADS = 32 +HEAD_DIM = 128 +INV_LOG_2 = 1.0 / math.log(2) + +SEQ_LENS = [1024, 2048, 4096, 8192, 16384] + + +def get_fmha_config(): + """ + Get target-specific FMHA configuration (from TileGym attention.py) + + Returns configs matching TileGym's _fmha_autotune_configs(): + - sm120/sm121 (DGX Spark): TILE_M=64, TILE_N=64, num_ctas=1, occupancy=2 + - sm100 (Blackwell B300): Two configs to try via autotuning + """ + gpu_capability = torch.cuda.get_device_capability() + + if gpu_capability in [(12, 0), (12, 1)]: + return [ + SimpleNamespace( + name="DGX Spark (sm121)", + TILE_M=64, + TILE_N=64, + num_ctas=1, + occupancy=2 + ) + ] + else: + return [ + SimpleNamespace( + name="Blackwell B300 (sm100) - Config 1", + TILE_M=256, + TILE_N=128, + num_ctas=1, + occupancy=1 + ), + SimpleNamespace( + name="Blackwell B300 (sm100) - Config 2", + TILE_M=128, + TILE_N=128, + num_ctas=1, + occupancy=2 + ), + ] + + +def get_device(): + if torch.cuda.is_available(): + return torch.device("cuda") + raise RuntimeError("CUDA not available") + +def compute_flops(batch, heads, seq_len, head_dim, causal=True): + flops_per_matmul = 2.0 * batch * heads * seq_len * seq_len * head_dim + total_flops = 2 * flops_per_matmul + if causal: + total_flops *= 0.5 + return total_flops + +def benchmark_fn(fn, warmup=10, iterations=100): + """Benchmark using triton's do_bench_cudagraph for accurate timing""" + try: + import triton + # Use triton's cudagraph benchmark - same as TileGym + ms = triton.testing.do_bench_cudagraph(fn) + return ms + except (ImportError, Exception): + # Fallback to manual timing + for _ in range(warmup): + fn() + torch.cuda.synchronize() + + start = time.perf_counter() + for _ in range(iterations): + fn() + torch.cuda.synchronize() + end = time.perf_counter() + + return (end - start) / iterations * 1000 + +def reference_fmha(q, k, v, sm_scale, is_causal=True): + return torch.nn.functional.scaled_dot_product_attention( + q, k, v, attn_mask=None, dropout_p=0.0, is_causal=is_causal, scale=sm_scale + ) + +try: + import cuda.tile as ct + from cuda.tile import RoundingMode as RMd + CUTILE_AVAILABLE = True +except ImportError: + CUTILE_AVAILABLE = False + logger.log("[WARN] cuTile not available.") + +if CUTILE_AVAILABLE: + ConstInt = ct.Constant[int] + ConstBool = ct.Constant[bool] + + @ct.kernel() + def fmha_basic( + Q, K, V, Out, + qk_scale: float, + TILE_D: ConstInt, + H: ConstInt, + TILE_M: ConstInt, + TILE_N: ConstInt, + CAUSAL: ConstBool, + ): + """Step 1: Basic cuTile - no optimizations""" + bid_x = ct.bid(0) + bid_y = ct.bid(1) + batch_idx = bid_y // H + head_idx = bid_y % H + + offs_m = bid_x * TILE_M + ct.arange(TILE_M, dtype=ct.int32) + offs_m = offs_m[:, None] + offs_n_tile = ct.arange(TILE_N, dtype=ct.int32) + offs_n_tile = offs_n_tile[None, :] + + m_i = ct.full((TILE_M, 1), -math.inf, dtype=ct.float32) + l_i = ct.full((TILE_M, 1), 0.0, dtype=ct.float32) + acc = ct.full((TILE_M, TILE_D), 0.0, dtype=ct.float32) + + q = ct.load(Q, index=(batch_idx, head_idx, bid_x, 0), shape=(1, 1, TILE_M, TILE_D)) + q = q.reshape((TILE_M, TILE_D)) + + k_seqlen = K.shape[2] + if CAUSAL: + m_end = (bid_x + 1) * TILE_M + Tc = ct.cdiv(min(m_end, k_seqlen), TILE_N) + mask_start = (bid_x * TILE_M) // TILE_N + else: + Tc = ct.cdiv(k_seqlen, TILE_N) + mask_start = k_seqlen // TILE_N + + for j in range(0, Tc): + k_tile = ct.load(K, index=(batch_idx, head_idx, j, 0), shape=(1, 1, TILE_N, TILE_D)) + k_tile = k_tile.reshape((TILE_N, TILE_D)) + k_t = ct.permute(k_tile, (1, 0)) + + qk = ct.full((TILE_M, TILE_N), 0.0, dtype=ct.float32) + qk = ct.mma(q, k_t, qk) + qk = qk * qk_scale + + if CAUSAL and j >= mask_start: + offs_n = j * TILE_N + offs_n_tile + mask = offs_m >= offs_n + qk = ct.where(mask, qk, ct.full((TILE_M, TILE_N), -math.inf, dtype=ct.float32)) + + m_ij = ct.max(qk, axis=-1, keepdims=True) + m_ij = ct.maximum(m_i, m_ij) + qk = qk - m_ij + + p = ct.exp(qk) + l_ij = ct.sum(p, axis=-1, keepdims=True) + alpha = ct.exp(m_i - m_ij) + l_i = l_i * alpha + l_ij + acc = acc * alpha + + v_tile = ct.load(V, index=(batch_idx, head_idx, j, 0), shape=(1, 1, TILE_N, TILE_D)) + v_tile = v_tile.reshape((TILE_N, TILE_D)) + p_cast = p.astype(Q.dtype) + acc = ct.mma(p_cast, v_tile, acc) + m_i = m_ij + + acc = acc / l_i + acc = acc.reshape((1, 1, TILE_M, TILE_D)).astype(Out.dtype) + ct.store(Out, index=(batch_idx, head_idx, bid_x, 0), tile=acc) + + @ct.kernel() + def fmha_math_opt( + Q, K, V, Out, + qk_scale: float, + TILE_D: ConstInt, + H: ConstInt, + TILE_M: ConstInt, + TILE_N: ConstInt, + CAUSAL: ConstBool, + ): + """Step 2: Math optimizations - exp2 + flush_to_zero""" + bid_x = ct.bid(0) + bid_y = ct.bid(1) + batch_idx = bid_y // H + head_idx = bid_y % H + + qk_scale_log2 = qk_scale * INV_LOG_2 + + offs_m = bid_x * TILE_M + ct.arange(TILE_M, dtype=ct.int32) + offs_m = offs_m[:, None] + offs_n_tile = ct.arange(TILE_N, dtype=ct.int32) + offs_n_tile = offs_n_tile[None, :] + + m_i = ct.full((TILE_M, 1), -math.inf, dtype=ct.float32) + l_i = ct.full((TILE_M, 1), 0.0, dtype=ct.float32) + acc = ct.full((TILE_M, TILE_D), 0.0, dtype=ct.float32) + + q = ct.load(Q, index=(batch_idx, head_idx, bid_x, 0), shape=(1, 1, TILE_M, TILE_D)) + q = q.reshape((TILE_M, TILE_D)) + + k_seqlen = K.shape[2] + if CAUSAL: + m_end = (bid_x + 1) * TILE_M + Tc = ct.cdiv(min(m_end, k_seqlen), TILE_N) + mask_start = (bid_x * TILE_M) // TILE_N + else: + Tc = ct.cdiv(k_seqlen, TILE_N) + mask_start = k_seqlen // TILE_N + + for j in range(0, Tc): + k_tile = ct.load(K, index=(batch_idx, head_idx, j, 0), shape=(1, 1, TILE_N, TILE_D)) + k_tile = k_tile.reshape((TILE_N, TILE_D)) + k_t = ct.permute(k_tile, (1, 0)) + + qk = ct.full((TILE_M, TILE_N), 0.0, dtype=ct.float32) + qk = ct.mma(q, k_t, qk) + + if CAUSAL and j >= mask_start: + offs_n = j * TILE_N + offs_n_tile + mask = offs_m >= offs_n + qk = ct.where(mask, qk, ct.full((TILE_M, TILE_N), -math.inf, dtype=ct.float32)) + + m_ij = ct.max(qk, axis=-1, keepdims=True) * qk_scale_log2 + m_ij = ct.maximum(m_i, m_ij) + qk = qk * qk_scale_log2 - m_ij + + p = ct.exp2(qk, flush_to_zero=True) + l_ij = ct.sum(p, axis=-1, keepdims=True) + alpha = ct.exp2(m_i - m_ij, flush_to_zero=True) + l_i = l_i * alpha + l_ij + acc = acc * alpha + + v_tile = ct.load(V, index=(batch_idx, head_idx, j, 0), shape=(1, 1, TILE_N, TILE_D)) + v_tile = v_tile.reshape((TILE_N, TILE_D)) + p_cast = p.astype(Q.dtype) + acc = ct.mma(p_cast, v_tile, acc) + m_i = m_ij + + acc = acc / l_i + acc = acc.reshape((1, 1, TILE_M, TILE_D)).astype(Out.dtype) + ct.store(Out, index=(batch_idx, head_idx, bid_x, 0), tile=acc) + + @ct.kernel() + def fmha_memory_opt( + Q, K, V, Out, + qk_scale: float, + TILE_D: ConstInt, + H: ConstInt, + TILE_M: ConstInt, + TILE_N: ConstInt, + CAUSAL: ConstBool, + ): + """Step 3: Memory optimizations - load order + latency hints""" + bid_x = ct.bid(0) + bid_y = ct.bid(1) + batch_idx = bid_y // H + head_idx = bid_y % H + + qk_scale_log2 = qk_scale * INV_LOG_2 + + offs_m = bid_x * TILE_M + ct.arange(TILE_M, dtype=ct.int32) + offs_m = offs_m[:, None] + offs_n_tile = ct.arange(TILE_N, dtype=ct.int32) + offs_n_tile = offs_n_tile[None, :] + + m_i = ct.full((TILE_M, 1), -math.inf, dtype=ct.float32) + l_i = ct.full((TILE_M, 1), 0.0, dtype=ct.float32) + acc = ct.full((TILE_M, TILE_D), 0.0, dtype=ct.float32) + + q = ct.load(Q, index=(batch_idx, head_idx, bid_x, 0), shape=(1, 1, TILE_M, TILE_D)) + q = q.reshape((TILE_M, TILE_D)) + + k_seqlen = K.shape[2] + if CAUSAL: + m_end = (bid_x + 1) * TILE_M + Tc = ct.cdiv(min(m_end, k_seqlen), TILE_N) + mask_start = (bid_x * TILE_M) // TILE_N + else: + Tc = ct.cdiv(k_seqlen, TILE_N) + mask_start = k_seqlen // TILE_N + + for j in range(0, Tc): + k_t = ct.load( + K, + index=(batch_idx, head_idx, 0, j), + shape=(1, 1, TILE_D, TILE_N), + order=(0, 1, 3, 2), + latency=2 + ) + k_t = k_t.reshape((TILE_D, TILE_N)) + + qk = ct.full((TILE_M, TILE_N), 0.0, dtype=ct.float32) + qk = ct.mma(q, k_t, qk) + + if CAUSAL and j >= mask_start: + offs_n = j * TILE_N + offs_n_tile + mask = offs_m >= offs_n + qk = ct.where(mask, qk, ct.full((TILE_M, TILE_N), -math.inf, dtype=ct.float32)) + + m_ij = ct.max(qk, axis=-1, keepdims=True) * qk_scale_log2 + m_ij = ct.maximum(m_i, m_ij) + qk = qk * qk_scale_log2 - m_ij + + p = ct.exp2(qk, flush_to_zero=True) + l_ij = ct.sum(p, axis=-1, keepdims=True) + alpha = ct.exp2(m_i - m_ij, flush_to_zero=True) + l_i = l_i * alpha + l_ij + acc = acc * alpha + + v_tile = ct.load( + V, + index=(batch_idx, head_idx, j, 0), + shape=(1, 1, TILE_N, TILE_D), + latency=4 + ) + v_tile = v_tile.reshape((TILE_N, TILE_D)) + p_cast = p.astype(Q.dtype) + acc = ct.mma(p_cast, v_tile, acc) + m_i = m_ij + + acc = acc / l_i + acc = acc.reshape((1, 1, TILE_M, TILE_D)).astype(Out.dtype) + ct.store(Out, index=(batch_idx, head_idx, bid_x, 0), tile=acc) + + @ct.kernel(occupancy=2) + def fmha_full_opt_occ2( + Q, K, V, Out, + qk_scale: float, + TILE_D: ConstInt, + H: ConstInt, + TILE_M: ConstInt, + TILE_N: ConstInt, + CAUSAL: ConstBool, + ): + """Step 4a: Full optimization with occupancy=2 (for sm120/sm121)""" + bid_x = ct.bid(0) + bid_y = ct.bid(1) + batch_idx = bid_y // H + head_idx = bid_y % H + + qk_scale_log2 = qk_scale * INV_LOG_2 + + offs_m = bid_x * TILE_M + ct.arange(TILE_M, dtype=ct.int32) + offs_m = offs_m[:, None] + offs_n_tile = ct.arange(TILE_N, dtype=ct.int32) + offs_n_tile = offs_n_tile[None, :] + + m_i = ct.full((TILE_M, 1), -math.inf, dtype=ct.float32) + l_i = ct.full((TILE_M, 1), 0.0, dtype=ct.float32) + acc = ct.full((TILE_M, TILE_D), 0.0, dtype=ct.float32) + + q = ct.load(Q, index=(batch_idx, head_idx, bid_x, 0), shape=(1, 1, TILE_M, TILE_D)) + q = q.reshape((TILE_M, TILE_D)) + + k_seqlen = K.shape[2] + if CAUSAL: + m_end = (bid_x + 1) * TILE_M + Tc = ct.cdiv(min(m_end, k_seqlen), TILE_N) + mask_start = (bid_x * TILE_M) // TILE_N + else: + Tc = ct.cdiv(k_seqlen, TILE_N) + mask_start = k_seqlen // TILE_N + + for j in range(0, Tc): + k_t = ct.load( + K, + index=(batch_idx, head_idx, 0, j), + shape=(1, 1, TILE_D, TILE_N), + order=(0, 1, 3, 2), + latency=2 + ) + k_t = k_t.reshape((TILE_D, TILE_N)) + + qk = ct.full((TILE_M, TILE_N), 0.0, dtype=ct.float32) + qk = ct.mma(q, k_t, qk) + + if CAUSAL and j >= mask_start: + offs_n = j * TILE_N + offs_n_tile + mask = offs_m >= offs_n + qk = ct.where(mask, qk, ct.full((TILE_M, TILE_N), -math.inf, dtype=ct.float32)) + + m_ij = ct.max(qk, axis=-1, keepdims=True) * qk_scale_log2 + m_ij = ct.maximum(m_i, m_ij) + qk = qk * qk_scale_log2 - m_ij + + p = ct.exp2(qk, flush_to_zero=True) + l_ij = ct.sum(p, axis=-1, keepdims=True) + alpha = ct.exp2(m_i - m_ij, flush_to_zero=True) + l_i = l_i * alpha + l_ij + acc = acc * alpha + + v_tile = ct.load( + V, + index=(batch_idx, head_idx, j, 0), + shape=(1, 1, TILE_N, TILE_D), + latency=4 + ) + v_tile = v_tile.reshape((TILE_N, TILE_D)) + p_cast = p.astype(Q.dtype) + acc = ct.mma(p_cast, v_tile, acc) + m_i = m_ij + + acc = ct.truediv(acc, l_i, flush_to_zero=True, rounding_mode=RMd.APPROX) + acc = acc.reshape((1, 1, TILE_M, TILE_D)).astype(Out.dtype) + ct.store(Out, index=(batch_idx, head_idx, bid_x, 0), tile=acc) + + @ct.kernel(occupancy=1) + def fmha_full_opt_occ1( + Q, K, V, Out, + qk_scale: float, + TILE_D: ConstInt, + H: ConstInt, + TILE_M: ConstInt, + TILE_N: ConstInt, + CAUSAL: ConstBool, + ): + """Step 4b: Full optimization with occupancy=1 (for sm100 Blackwell)""" + bid_x = ct.bid(0) + bid_y = ct.bid(1) + batch_idx = bid_y // H + head_idx = bid_y % H + + qk_scale_log2 = qk_scale * INV_LOG_2 + + offs_m = bid_x * TILE_M + ct.arange(TILE_M, dtype=ct.int32) + offs_m = offs_m[:, None] + offs_n_tile = ct.arange(TILE_N, dtype=ct.int32) + offs_n_tile = offs_n_tile[None, :] + + m_i = ct.full((TILE_M, 1), -math.inf, dtype=ct.float32) + l_i = ct.full((TILE_M, 1), 0.0, dtype=ct.float32) + acc = ct.full((TILE_M, TILE_D), 0.0, dtype=ct.float32) + + q = ct.load(Q, index=(batch_idx, head_idx, bid_x, 0), shape=(1, 1, TILE_M, TILE_D)) + q = q.reshape((TILE_M, TILE_D)) + + k_seqlen = K.shape[2] + if CAUSAL: + m_end = (bid_x + 1) * TILE_M + Tc = ct.cdiv(min(m_end, k_seqlen), TILE_N) + mask_start = (bid_x * TILE_M) // TILE_N + else: + Tc = ct.cdiv(k_seqlen, TILE_N) + mask_start = k_seqlen // TILE_N + + for j in range(0, Tc): + k_t = ct.load( + K, + index=(batch_idx, head_idx, 0, j), + shape=(1, 1, TILE_D, TILE_N), + order=(0, 1, 3, 2), + latency=2 + ) + k_t = k_t.reshape((TILE_D, TILE_N)) + + qk = ct.full((TILE_M, TILE_N), 0.0, dtype=ct.float32) + qk = ct.mma(q, k_t, qk) + + if CAUSAL and j >= mask_start: + offs_n = j * TILE_N + offs_n_tile + mask = offs_m >= offs_n + qk = ct.where(mask, qk, ct.full((TILE_M, TILE_N), -math.inf, dtype=ct.float32)) + + m_ij = ct.max(qk, axis=-1, keepdims=True) * qk_scale_log2 + m_ij = ct.maximum(m_i, m_ij) + qk = qk * qk_scale_log2 - m_ij + + p = ct.exp2(qk, flush_to_zero=True) + l_ij = ct.sum(p, axis=-1, keepdims=True) + alpha = ct.exp2(m_i - m_ij, flush_to_zero=True) + l_i = l_i * alpha + l_ij + acc = acc * alpha + + v_tile = ct.load( + V, + index=(batch_idx, head_idx, j, 0), + shape=(1, 1, TILE_N, TILE_D), + latency=4 + ) + v_tile = v_tile.reshape((TILE_N, TILE_D)) + p_cast = p.astype(Q.dtype) + acc = ct.mma(p_cast, v_tile, acc) + m_i = m_ij + + acc = ct.truediv(acc, l_i, flush_to_zero=True, rounding_mode=RMd.APPROX) + acc = acc.reshape((1, 1, TILE_M, TILE_D)).astype(Out.dtype) + ct.store(Out, index=(batch_idx, head_idx, bid_x, 0), tile=acc) + + def run_kernel(kernel_fn, q, k, v, sm_scale, tile_m, tile_n, is_causal=True): + batch_size, num_heads, seq_len, head_dim = q.shape + o = torch.empty_like(q) + grid = (math.ceil(seq_len / tile_m), batch_size * num_heads, 1) + ct.launch( + torch.cuda.current_stream(), + grid, + kernel_fn, + (q, k, v, o, sm_scale, head_dim, num_heads, tile_m, tile_n, is_causal) + ) + return o + +def run_tilegym_fmha(q, k, v, sm_scale, is_causal=True): + try: + import tilegym + return tilegym.ops.fmha(q, k, v, scaling=sm_scale, is_causal=is_causal, backend="cutile") + except ImportError: + return None + + +def run_scaling_analysis(iterations=100): + device = get_device() + gpu_cap = torch.cuda.get_device_capability() + gpu_name = torch.cuda.get_device_name() + + configs = get_fmha_config() + primary_cfg = configs[0] + TILE_M = primary_cfg.TILE_M + TILE_N = primary_cfg.TILE_N + + logger.section("FMHA SCALING ANALYSIS (TileGym Benchmark Match)") + logger.log("Matching TileGym bench_fused_attention.py configuration") + logger.log(f"\nGPU: {gpu_name} (sm_{gpu_cap[0]}{gpu_cap[1]})") + + logger.subsection("TARGET-SPECIFIC CONFIGURATION (from TileGym)") + for cfg in configs: + logger.log(f"\n {cfg.name}:") + logger.log(f" TILE_M={cfg.TILE_M}, TILE_N={cfg.TILE_N}") + logger.log(f" num_ctas={cfg.num_ctas}, occupancy={cfg.occupancy}") + + logger.log(f"\nUsing primary config: TILE_M={TILE_M}, TILE_N={TILE_N}, occupancy={primary_cfg.occupancy}") + logger.log("\nTest Configuration (matches TileGym bench_fused_attention.py):") + logger.log(f" Batch: {BATCH}, Heads: {N_HEADS}, Head Dim: {HEAD_DIM}") + logger.log(" Causal: True, Precision: float16") + logger.log(f" Sequence Lengths: {SEQ_LENS}") + logger.log(" Benchmark: triton.testing.do_bench_cudagraph (same as TileGym)") + + logger.section("OPTIMIZATION STEPS") + logger.log(f""" +Step 0: PyTorch Baseline + - torch.nn.functional.scaled_dot_product_attention + - Uses cuDNN Flash Attention backend + - Highly optimized reference + +Step 1: Basic cuTile (TILE_M={TILE_M}, TILE_N={TILE_N}) + - @ct.kernel with ct.mma() for Tensor Cores + - Standard exp() for softmax + - Explicit transpose with ct.permute() + - No memory/occupancy hints + +Step 2: Math Optimizations + - ct.exp2() instead of ct.exp() (faster on GPU) + - flush_to_zero=True for denormals + - Scale adjustment: multiply by 1/log(2) + +Step 3: Memory Optimizations + - Load order=(0,1,3,2) for implicit K transpose + - Latency hints: K=2, V=4 for prefetching + - Overlaps memory loads with computation + +Step 4: Full Optimization (Target-Specific) + - @ct.kernel(occupancy={primary_cfg.occupancy}) for {'sm120/121' if primary_cfg.occupancy == 2 else 'sm100'} + - ct.truediv with APPROX rounding mode + - Matches TileGym production implementation +""") + + logger.section("PLATFORM DIFFERENCES: DGX Spark vs Blackwell B300") + logger.log(""" +| Parameter | DGX Spark (sm121) | Blackwell B300 (sm100) | +|--------------|-------------------|------------------------| +| TILE_M | 64 | 256 or 128 | +| TILE_N | 64 | 128 | +| num_ctas | 1 | 1 | +| occupancy | 2 | 1 or 2 | + +Why the difference? +- B300 has more SMs and larger shared memory -> can use bigger tiles +- B300 benefits from larger tiles (256x128) with lower occupancy +- DGX Spark needs smaller tiles (64x64) with higher occupancy to hide latency +- B300's higher memory bandwidth makes larger tiles more efficient +""") + + all_results = [] + + select_kernel = fmha_full_opt_occ2 if primary_cfg.occupancy == 2 else fmha_full_opt_occ1 + + for seq_len in SEQ_LENS: + logger.subsection(f"Sequence Length: {seq_len}") + + q = torch.randn(BATCH, N_HEADS, seq_len, HEAD_DIM, dtype=torch.float16, device=device) + k = torch.randn(BATCH, N_HEADS, seq_len, HEAD_DIM, dtype=torch.float16, device=device) + v = torch.randn(BATCH, N_HEADS, seq_len, HEAD_DIM, dtype=torch.float16, device=device) + sm_scale = 1.0 / math.sqrt(HEAD_DIM) + + flops = compute_flops(BATCH, N_HEADS, seq_len, HEAD_DIM, causal=True) + + steps_results = [] + + baseline_fn = lambda: reference_fmha(q, k, v, sm_scale, is_causal=True) + baseline_latency = benchmark_fn(baseline_fn, warmup=10, iterations=iterations) + baseline_tflops = flops * 1e-12 / (baseline_latency * 1e-3) + steps_results.append(StepResult(0, "PyTorch Baseline", baseline_latency, baseline_tflops, 1.0)) + + if CUTILE_AVAILABLE: + kernels = [ + (1, "Basic cuTile", fmha_basic), + (2, "Math Opt (exp2)", fmha_math_opt), + (3, "Memory Opt (order+latency)", fmha_memory_opt), + (4, f"Full Opt (occ={primary_cfg.occupancy})", select_kernel), + ] + + for step, name, kernel in kernels: + try: + fn = lambda kernel=kernel: run_kernel(kernel, q, k, v, sm_scale, TILE_M, TILE_N, is_causal=True) + latency = benchmark_fn(fn, warmup=10, iterations=iterations) + tflops = flops * 1e-12 / (latency * 1e-3) + speedup = baseline_latency / latency + steps_results.append(StepResult(step, name, latency, tflops, speedup)) + except Exception as e: + logger.log(f" [ERROR] Step {step} failed: {e}") + + tilegym_latency = 0.0 + tilegym_tflops = 0.0 + tilegym_speedup = 0.0 + tilegym_out = run_tilegym_fmha(q, k, v, sm_scale, is_causal=True) + if tilegym_out is not None: + tilegym_fn = lambda: run_tilegym_fmha(q, k, v, sm_scale, is_causal=True) + tilegym_latency = benchmark_fn(tilegym_fn, warmup=10, iterations=iterations) + tilegym_tflops = flops * 1e-12 / (tilegym_latency * 1e-3) + tilegym_speedup = baseline_latency / tilegym_latency + + best_step = max(steps_results, key=lambda x: x.speedup_vs_baseline) + + result = SeqLenResult( + seq_len=seq_len, + steps=steps_results, + best_step=best_step.step, + best_speedup=best_step.speedup_vs_baseline, + tilegym_latency_ms=tilegym_latency, + tilegym_tflops=tilegym_tflops, + tilegym_speedup=tilegym_speedup, + ) + all_results.append(result) + + logger.log("\n | Step | Name | Latency (ms) | TFLOPS | Speedup |") + logger.log(" |------|------|--------------|--------|---------|") + + for sr in steps_results: + logger.log(f" | {sr.step} | {sr.name:<28} | {sr.latency_ms:>10.3f} | {sr.tflops:>6.2f} | {sr.speedup_vs_baseline:>6.2f}x |") + if tilegym_latency > 0: + logger.log(f" | TG | TileGym Reference | {tilegym_latency:>10.3f} | {tilegym_tflops:>6.2f} | {tilegym_speedup:>6.2f}x |") + + logger.log(f"\n Best: Step {best_step.step} ({best_step.name}) with {best_step.speedup_vs_baseline:.2f}x speedup") + + logger.results = all_results + return all_results + + +def print_summary(results: List[SeqLenResult]): + configs = get_fmha_config() + primary_cfg = configs[0] + + logger.section("SCALING SUMMARY") + + logger.log(f"\nTarget Config: TILE_M={primary_cfg.TILE_M}, TILE_N={primary_cfg.TILE_N}, occupancy={primary_cfg.occupancy}") + + logger.log("\n## Performance vs Sequence Length\n") + logger.log("| Seq Len | Baseline (ms) | Full Opt (ms) | Speedup | TileGym (ms) | TG Speedup |") + logger.log("|---------|---------------|---------------|---------|--------------|------------|") + for r in results: + baseline = next((s for s in r.steps if s.step == 0), None) + full_opt = next((s for s in r.steps if s.step == 4), None) + if baseline and full_opt: + logger.log(f"| {r.seq_len:>7} | {baseline.latency_ms:>13.3f} | {full_opt.latency_ms:>13.3f} | {full_opt.speedup_vs_baseline:>6.2f}x | {r.tilegym_latency_ms:>12.3f} | {r.tilegym_speedup:>9.2f}x |") + + logger.log("\n## Optimization Impact by Sequence Length\n") + logger.log("| Seq Len | Basic | +Math | +Memory | +Full | Best |") + logger.log("|---------|-------|-------|---------|-------|------|") + for r in results: + row = f"| {r.seq_len:>7} |" + for step in [1, 2, 3, 4]: + sr = next((s for s in r.steps if s.step == step), None) + if sr: + row += f" {sr.speedup_vs_baseline:>5.2f}x |" + else: + row += " N/A |" + row += f" {r.best_speedup:>4.2f}x |" + logger.log(row) + + logger.section("KEY INSIGHTS") + logger.log(""" +## Why Larger Sequences Benefit More from Optimization + +1. **Memory Bandwidth Dominance** + - Attention has O(NΒ²) memory complexity for the QK^T matrix + - At seq_len=8192: 8192Β² Γ— 4 bytes = 256MB per head per batch + - Memory optimizations (order, latency hints) have larger impact + +2. **More K-Loop Iterations** + - At seq_len=512: 8 K-tiles (512/64) for sm121, 2 K-tiles (512/256) for sm100 + - At seq_len=8192: 128 K-tiles for sm121, 32 K-tiles for sm100 + - Latency hiding through pipelining amortizes over more iterations + +3. **Better Occupancy Utilization** + - More tiles = more parallelism opportunities + - sm121 uses occupancy=2 (smaller tiles, more blocks) + - sm100 uses occupancy=1 with larger tiles (256x128) + +4. **Platform-Specific Tuning** + - DGX Spark (sm121): 64x64 tiles, occupancy=2 - optimized for bandwidth-limited workloads + - B300 (sm100): 256x128 tiles, occupancy=1 - optimized for compute-heavy workloads + +## Optimization Priority by Problem Size + +Small (seq_len <= 1024): + - Basic cuTile often sufficient + - Focus on correctness first + +Medium (1024 < seq_len <= 4096): + - Math optimizations (exp2) provide ~5% gain + - Memory optimizations start to matter + +Large (seq_len > 4096): + - Full optimization stack critical + - Platform-specific tuning essential + - Memory pipelining becomes essential +""") + + +def export_results(results: List[SeqLenResult], output_dir: str): + configs = get_fmha_config() + primary_cfg = configs[0] + + data = { + "config": { + "batch": BATCH, + "n_heads": N_HEADS, + "head_dim": HEAD_DIM, + "tile_m": primary_cfg.TILE_M, + "tile_n": primary_cfg.TILE_N, + "occupancy": primary_cfg.occupancy, + "num_ctas": primary_cfg.num_ctas, + "platform": primary_cfg.name, + }, + "results": [ + { + "seq_len": r.seq_len, + "steps": [asdict(s) for s in r.steps], + "best_step": r.best_step, + "best_speedup": r.best_speedup, + "tilegym_latency_ms": r.tilegym_latency_ms, + "tilegym_tflops": r.tilegym_tflops, + "tilegym_speedup": r.tilegym_speedup, + } + for r in results + ] + } + + json_path = f"{output_dir}/fmha_scaling_results.json" + with open(json_path, 'w') as f: + json.dump(data, f, indent=2) + + md_path = f"{output_dir}/fmha_scaling_results.md" + with open(md_path, 'w') as f: + f.write("# FMHA Scaling Analysis Results\n\n") + f.write("## Configuration\n") + f.write(f"- Platform: {primary_cfg.name}\n") + f.write(f"- Batch: {BATCH}, Heads: {N_HEADS}, Head Dim: {HEAD_DIM}\n") + f.write(f"- Tile: {primary_cfg.TILE_M}x{primary_cfg.TILE_N}, occupancy={primary_cfg.occupancy}\n\n") + + f.write("## Target-Specific Configs (from TileGym)\n\n") + f.write("| Platform | TILE_M | TILE_N | num_ctas | occupancy |\n") + f.write("|----------|--------|--------|----------|----------|\n") + f.write("| DGX Spark (sm121) | 64 | 64 | 1 | 2 |\n") + f.write("| B300 (sm100) Config 1 | 256 | 128 | 1 | 1 |\n") + f.write("| B300 (sm100) Config 2 | 128 | 128 | 1 | 2 |\n\n") + + f.write("## Results by Sequence Length\n\n") + for r in results: + f.write(f"### Seq Len = {r.seq_len}\n\n") + f.write("| Step | Name | Latency (ms) | TFLOPS | Speedup |\n") + f.write("|------|------|--------------|--------|--------|\n") + for s in r.steps: + f.write(f"| {s.step} | {s.name} | {s.latency_ms:.3f} | {s.tflops:.2f} | {s.speedup_vs_baseline:.2f}x |\n") + if r.tilegym_latency_ms > 0: + f.write(f"| TG | TileGym Reference | {r.tilegym_latency_ms:.3f} | {r.tilegym_tflops:.2f} | {r.tilegym_speedup:.2f}x |\n") + f.write("\n") + + log_path = f"{output_dir}/fmha_scaling_log.txt" + with open(log_path, 'w') as f: + f.write('\n'.join(logger.logs)) + + logger.log(f"\nResults exported to:") + logger.log(f" - {json_path}") + logger.log(f" - {md_path}") + logger.log(f" - {log_path}") + + +def main(): + parser = argparse.ArgumentParser(description="FMHA Scaling Analysis") + parser.add_argument("--iterations", type=int, default=100, help="Benchmark iterations") + parser.add_argument("--output-dir", type=str, default=".", help="Output directory") + args = parser.parse_args() + + results = run_scaling_analysis(iterations=args.iterations) + print_summary(results) + export_results(results, args.output_dir) + + +if __name__ == "__main__": + main() diff --git a/nvidia/nemoclaw-applications/README.md b/nvidia/nemoclaw-applications/README.md new file mode 100644 index 0000000..fcd0134 --- /dev/null +++ b/nvidia/nemoclaw-applications/README.md @@ -0,0 +1,1412 @@ +# 🦞 Set Up Example NemoClaw Agents 🦞 + +> Ready-to-run application examples for your NemoClaw sandbox β€” policy, prompt, and personalization for each workflow + + +## Table of Contents + +- [Overview](#overview) +- [Daily Personal News Digest](#daily-personal-news-digest) +- [Software Development Agent](#software-development-agent) + - [Requested features](#requested-features) + - [Project context](#project-context) + - [Execution plan](#execution-plan) + - [Implementation summary](#implementation-summary) + - [Self-review](#self-review) + - [Test results](#test-results) + - [Open questions for the human](#open-questions-for-the-human) +- [Deck Reviewer](#deck-reviewer) + - [Create the red-team working directory](#create-the-red-team-working-directory) + - [Bind the red-team directory into the sandbox](#bind-the-red-team-directory-into-the-sandbox) + - [CRITICAL](#critical) + - [HIGH](#high) + - [MEDIUM](#medium) + - [NICE-TO-FIX](#nice-to-fix) + - [Dismissed (active, not re-flagged)](#dismissed-active-not-re-flagged) + - [Open questions for the human](#open-questions-for-the-human) +- [Calendar Negotiator](#calendar-negotiator) + - [Create the calendar working directory](#create-the-calendar-working-directory) + - [Bind the calendar directory into the sandbox](#bind-the-calendar-directory-into-the-sandbox) +- [NemoClaw Policy Setup](#nemoclaw-policy-setup) +- [Troubleshooting](#troubleshooting) + - [General sandbox & policy issues](#general-sandbox-policy-issues) + - [[NemoClaw Policy Setup](https://build.nvidia.com/spark/nemoclaw-applications/policy-setup)](#nemoclaw-policy-setuphttpsbuildnvidiacomsparknemoclaw-applicationspolicy-setup) + - [[Daily Personal News Digest](https://build.nvidia.com/spark/nemoclaw-applications/news-digest)](#daily-personal-news-digesthttpsbuildnvidiacomsparknemoclaw-applicationsnews-digest) + - [[Software Development Agent](https://build.nvidia.com/spark/nemoclaw-applications/developer-agent)](#software-development-agenthttpsbuildnvidiacomsparknemoclaw-applicationsdeveloper-agent) + - [[Deck Reviewer](https://build.nvidia.com/spark/nemoclaw-applications/deck-reviewer)](#deck-reviewerhttpsbuildnvidiacomsparknemoclaw-applicationsdeck-reviewer) + - [[Calendar Negotiator](https://build.nvidia.com/spark/nemoclaw-applications/calendar-negotiator)](#calendar-negotiatorhttpsbuildnvidiacomsparknemoclaw-applicationscalendar-negotiator) + +--- + +## Overview + +## Basic idea + +This playbook is a companion to the [NemoClaw on DGX Spark](https://build.nvidia.com/spark/nemoclaw) install playbook. It walks through **four ready-to-run applications** you can stand up on top of an existing NemoClaw sandbox β€” a personal morning news digest, a software development agent, a doc and deck red-team, and a calendar negotiation chief-of-staff. + +Each application is presented as a self-contained tab with the same three sections: + +- **Policy setup** β€” the exact NemoClaw / OpenShell sandbox policy changes the workflow needs (channels, network egress, filesystem mounts). +- **Agent prompt** β€” the full canonical prompt you copy-paste into the NemoClaw web UI or send to your Telegram bot. It defines the agent's complete behavior end-to-end and is the only configuration the workflow needs. +- **How to personalize** β€” the knobs to turn (paths, schedule, audience, persona) to adapt the recipe to your real use case. + +All applications run inside the **OpenShell sandbox** that NemoClaw created during onboarding, so the agent's filesystem, network, process, and inference access stays bounded by the policy you grant. + +## What you'll accomplish + +You will run four practical NemoClaw workflows on your DGX Spark: + +- **[Daily Personal News Digest](https://build.nvidia.com/spark/nemoclaw-applications/news-digest)** β€” a scheduled morning briefing that wakes up on a cron, sweeps the topics you care about across an allowlisted set of sources, and posts a structured digest (Top 3, headlines by topic, deep dive, skip-the-noise, on-your-radar, local) to your Telegram home channel. +- **[Software Development Agent](https://build.nvidia.com/spark/nemoclaw-applications/developer-agent)** β€” reads a single project directory, builds an execution plan for the features you specify, implements them, reviews its own work, and writes a `develop-and-review.md` you can read before merging. No outbound network beyond the local inference endpoint. +- **[Deck Reviewer](https://build.nvidia.com/spark/nemoclaw-applications/deck-reviewer)** β€” a Doc & Deck Red-Team that scans the artifact you're about to send for inconsistent numbers, unsourced claims, missing data, accessibility issues, and prior-version contradictions, then returns a severity-ranked punch list with proposed edits. +- **[Calendar Negotiator](https://build.nvidia.com/spark/nemoclaw-applications/calendar-negotiator)** β€” a scheduling chief-of-staff that turns "when can we meet?" threads into a confirmed meeting on your calendar, respecting your focus blocks, energy patterns, and time-zone fairness with the other party. + +A separate **[NemoClaw Policy Setup](https://build.nvidia.com/spark/nemoclaw-applications/policy-setup)** tab covers the one-time Telegram channel wiring that two of the applications (News Digest and Calendar Negotiator) require and the other two (Software Development Agent and Deck Reviewer) can optionally use for "ready for review" notifications. The **Troubleshooting** tab collects symptom/cause/fix entries specific to these workflows. + +For each application you will be able to read the live policy YAML (`openshell policy get --full`), apply or remove maintained presets with `nemoclaw policy-add` / `policy-remove` (no rebuild required for network changes), and bind host directories into the sandbox with `nemoclaw share mount` (hot β€” no rebuild required for mounts either). Tightening `filesystem_policy` itself, when you want a kernel-enforced write boundary inside the sandbox, is the only step that still requires `nemoclaw rebuild` (workspace state is preserved automatically). + +## What to know before starting + +- You have completed the [NemoClaw on DGX Spark](https://build.nvidia.com/spark/nemoclaw) playbook and have a working sandbox (the examples use `my-assistant`). +- Basic comfort with the Linux terminal and YAML files. +- Awareness of the agent risk surface β€” see the *Important: security and risks* section in the NemoClaw overview. + +## Prerequisites + +**Hardware and access:** + +- A DGX Spark (GB10) with a working NemoClaw install (see [NemoClaw on DGX Spark](https://build.nvidia.com/spark/nemoclaw)). +- A running OpenShell gateway and a sandbox created by the NemoClaw onboard wizard (`nemoclaw list` shows at least one sandbox). +- A Telegram bot wired into the sandbox at onboard time for the **Daily Personal News Digest** and **Calendar Negotiator** applications. If you skipped Telegram during onboard, re-run the NemoClaw installer to recreate the sandbox with Telegram enabled. See **[NemoClaw Policy Setup](https://build.nvidia.com/spark/nemoclaw-applications/policy-setup)** for the one-time wiring steps. + +**Software:** + +- Ollama serving the model you selected during NemoClaw onboard (Nemotron 3 Super 120B in the install playbook). +- A working public webhook tunnel (`nemoclaw tunnel start`) for any Telegram-driven application. + +Verify the sandbox is healthy before you start: + +```bash +nemoclaw list +nemoclaw my-assistant status +``` + +Expected: your sandbox appears in the list and `status` reports the sandbox as **Running** with the inference provider pointing at your local Ollama model. + +## Have ready before you begin + +| Item | Where to get it | Used by | +|------|----------------|---------| +| Sandbox name from NemoClaw onboard (e.g. `my-assistant`) | `nemoclaw list` | All applications | +| Telegram bot token and numeric user ID | [@BotFather](https://t.me/BotFather) (`/newbot`), `@userinfobot` on Telegram for your user ID | [Policy Setup](https://build.nvidia.com/spark/nemoclaw-applications/policy-setup), [News Digest](https://build.nvidia.com/spark/nemoclaw-applications/news-digest), [Calendar Negotiator](https://build.nvidia.com/spark/nemoclaw-applications/calendar-negotiator); optional for [Software Development Agent](https://build.nvidia.com/spark/nemoclaw-applications/developer-agent) and [Deck Reviewer](https://build.nvidia.com/spark/nemoclaw-applications/deck-reviewer) | +| Allowlist of news source hostnames to add under `network_policies` | Pick the sites you trust | [News Digest](https://build.nvidia.com/spark/nemoclaw-applications/news-digest) | +| A host directory containing the project you want built and reviewed | A copy/clone of the project, e.g. `~/nemoclaw-projects/my-app/` | [Software Development Agent](https://build.nvidia.com/spark/nemoclaw-applications/developer-agent) | +| A queue folder, a canonical corpus folder, and a `profile.yaml` for red-team rules | Curate from prior decks, brand guide, and canonical metric files, e.g. `~/nemoclaw-redteam/` | [Deck Reviewer](https://build.nvidia.com/spark/nemoclaw-applications/deck-reviewer) | +| A `calendar.ics` export and a `profile.yaml` with working hours, focus blocks, and timezone | Export from your real calendar (Google: *Settings β†’ Import & export*) into `~/nemoclaw-calendar/` | [Calendar Negotiator](https://build.nvidia.com/spark/nemoclaw-applications/calendar-negotiator) | + +## Ancillary files + +All policy snippets and example prompts in this playbook are inline in the application tabs β€” there are no external assets to clone. The bundled sandbox policy is shipped with NemoClaw and OpenShell; the application tabs only **modify** it. + +## Time and risk + +- **Estimated time:** 30–45 minutes to walk through all four applications. Each application individually takes 5–10 minutes once the prerequisites are in place. Plan an extra 10 minutes for the one-time [Policy Setup](https://build.nvidia.com/spark/nemoclaw-applications/policy-setup) tab if you have not enabled Telegram yet. +- **Risk level:** **Medium.** Every application grants the agent additional capability beyond the default sandbox β€” outbound network for the news digest, filesystem access for code review, deck red-team, and calendar negotiation. Risk is reduced by tight per-application policies (host-level `chmod` on read-only source data backed by `share mount`'s SSHFS permission passthrough, scoped sandbox directories so the agent only sees one mounted tree at a time, explicit egress allowlists via `nemoclaw policy-add` presets, and in-prompt safety rules that survive single-message overrides) but is not eliminated. **Do not point these recipes at sensitive data, production accounts, or personal files** without reviewing the policy first. +- **Rollback:** Each application tab includes a rollback section that either reverts the policy (network changes are hot-reloadable) or destroys and recreates the sandbox with the original policy. The [Troubleshooting](https://build.nvidia.com/spark/nemoclaw-applications/troubleshooting) tab covers common stuck-state recovery. You can always run `nemoclaw uninstall` to remove everything. +- **Last Updated:** 06/01/2026 + - Sync up to latest nemoclaw/openshell policy APIs + +## Daily Personal News Digest + +## Daily Personal News Digest + +This is a cron-style workflow: the agent wakes up on a schedule, fetches updates from a small allowlist of URLs, summarizes them, and posts a digest to your Telegram home channel. + +## Step 1. Policy setup + +Start from the [NemoClaw Policy Setup](https://build.nvidia.com/spark/nemoclaw-applications/policy-setup) tab's working Telegram channel (channel plugin + `api.telegram.org` egress). Then **add network egress for the sources you want the agent to read** by applying a small custom preset with `nemoclaw policy-add --from-file`. The preset is additive and hot-reloads β€” you do **not** need to dump or round-trip the full live policy. + +Create `news-sources.yaml`: + +```yaml +preset: + name: news-sources + description: "Daily news digest source allowlist" + +network_policies: + news-sources: + name: news-sources + endpoints: + - host: developer.nvidia.com + port: 443 + access: full + tls: skip + - host: blogs.nvidia.com + port: 443 + access: full + tls: skip + - host: news.ycombinator.com + port: 443 + access: full + tls: skip + binaries: + - { path: /usr/local/bin/openclaw } + - { path: /usr/local/bin/node } + - { path: /usr/bin/node } + - { path: /usr/bin/curl } +``` + +`network_policies` is a **map** keyed by group name (here, `news-sources`); each group has its own `name` and an `endpoints` list. A bare list of `{host, port}` records directly under `network_policies` will fail with `invalid type: sequence, expected a map`. + +> [!IMPORTANT] +> Both `preset.name` and the `network_policies` group key must be **lowercase, hyphenated RFC 1123 labels** (letters, digits, and hyphens only β€” no underscores). Using `news_sources` fails with `Preset must declare preset.name (lowercase, hyphenated RFC 1123 label)`. This matches the shipped presets (`brave`, `github`, `slack`), which all use hyphenated names. + +> [!IMPORTANT] +> Each endpoint needs **two things beyond `host`/`port`, or the egress proxy denies the connection with `curl: (56) CONNECT tunnel failed, response 403`** even though the host shows up in the live policy: +> 1. An **access mode**. The simplest for fetching web pages is a raw pass-through tunnel β€” `access: full` with `tls: skip` (the same shape the shipped `whatsapp`/`brew` presets use). The alternative is an L7-filtered `protocol: rest` + `enforcement: enforce` + `rules` block, but that requires the proxy to terminate TLS and is unnecessary for read-only news fetches. +> 2. A **`binaries`** allow-list naming which programs may use this egress. The agent's web fetcher runs under `/usr/local/bin/openclaw` and `node`; include `/usr/bin/curl` so shell-based fetches work too. Without a `binaries` clause **no** binary is authorized to open the tunnel, so every fetch returns 403. +> +> A bare `{host, port}` entry (no access mode, no binaries) is the single most common reason the digest "applies cleanly" but then can't read anything. + +Apply the preset (hot-reload, no sandbox restart): + +```bash +nemoclaw $SANDBOX_NAME policy-add --from-file ./news-sources.yaml --yes +``` + +Confirm the new hosts are present: + +```bash +openshell policy get $SANDBOX_NAME --full | grep -E "host:|port:" +``` + +> [!TIP] +> Prefer `nemoclaw policy-add --from-file` over `openshell policy get --full > policy.yaml` followed by `openshell policy set`. The full-dump round trip in openshell `0.0.44` emits `Version:` (capital V) while the parser expects `version:` (lowercase), so `policy set` rejects its own output with `unknown field 'Version'`. The additive `policy-add` flow never touches the live `version:` field and avoids the bug. If you hit that error from an older recipe, lowercase the key in place β€” `sed -i 's/^Version:/version:/' policy.yaml` β€” and rerun `policy set`. + +## Step 2. Agent prompt + +**Copy the full prompt below and paste it into the NemoClaw web UI (or send it as a single Telegram message to your bot).** This is the canonical prompt β€” it defines the agent's complete behavior end-to-end, and no other configuration is required. It walks the agent through a one-time onboarding, a fixed briefing structure, style rules, error handling, and recurring schedule maintenance β€” so it works for a regular consumer who just wants to wake up informed, not buried. + +```text +You are my personal news intelligence analyst. Your job is to make sure I wake +up each morning already knowing the few things that matter β€” and never to +bury me in noise. + +ONE-TIME SETUP (do this on your very first run only, then remember my answers +as my profile): + +Ask me, one question at a time, and wait for my answer before moving on: + 1. What's on your news menu? Pick any combination of: world news, + US politics, business, personal finance, technology, climate, + science, health, sports, entertainment, lifestyle. You can also + name your own custom beats β€” anything from "Formula 1" to "indie + video games" to "my hometown city council" counts. + 2. Who should I sound like when I write to you? Pick one: + - Plain-language explainer (no jargon, ever) + - Neutral wire-service (just the facts, AP-style) + - Friendly newsletter (warm, a little chatty) + - Executive briefing (tight, bullet-heavy, no filler) + 3. How much time do you give me with your coffee? 60-second skim, + 3-minute read, or 10-minute deep brief β€” pick one and we can + change it any time. + 4. Any VIPs or villains? Tell me the people, companies, teams, or + topics I should always surface for you β€” and anything I should + never put in your briefing. + 5. Where are you waking up? Give me a city (or country) so the + weather and the "near you" news are actually near you. + 6. When's showtime? Default is 08:00 America/Los_Angeles every + weekday. Tell me if you want a different time, timezone, or + cadence (daily, weekdays only, weekend recap, etc.). + +Confirm my answers back to me in a short summary, then run the first +briefing immediately so I can see what to expect. + +DAILY BRIEFING STRUCTURE (use this exact shape every run, in this order): + + 1. Top 3 β€” the three stories I cannot miss today. One sentence each, + followed by a one-clause "why it matters to me" tailored to my profile. + 2. Headlines by topic β€” under each topic I follow, 3 to 5 bullet + headlines with the source name in parentheses and the URL. + 3. Deep dive β€” pick the single most important story of the day and + explain it in 4 to 6 short sentences: what happened, why now, who + is affected, what to watch next. + 4. Skip the noise β€” one or two lines naming stories that are loud + today but safe for me to ignore, with a brief reason. + 5. On my radar β€” events, earnings, votes, sports fixtures, or + deadlines in the next 7 days that match my profile. + 6. Local β€” a 2-sentence weather summary plus any notable local news + for the city I chose. + +STYLE RULES: + - Plain language; assume I am not an expert in any topic. + - No hype words ("shocking", "you won't believe", "breaking"). Just + the facts. + - Cite every claim with the source name and a working URL. + - Never invent quotes, numbers, dates, or events. If you cannot + verify a detail, omit it or label it clearly as "unconfirmed". + - Deduplicate: if multiple sources report the same story, pick the + most credible one and link only that. + - Respect my length preference. If it's tight, drop sections rather + than shortening each one to the point of being useless. + +ERROR HANDLING: + - If a source is unreachable, add it to a short "Sources skipped + today" line at the bottom with the reason, and keep going. + - If the news is genuinely quiet on a topic, write "Quiet day β€” + nothing material" instead of padding with filler. + - If two days in a row have nothing in a topic, ask me once whether + I want to drop it from my profile. + +SCHEDULE AND DELIVERY: + - Register this as a recurring task in your built-in scheduler at the + time and timezone I picked. Confirm the next 3 trigger times back + to me after onboarding. + - Deliver each briefing to my Telegram home channel. + - Skip US public holidays unless a major breaking story is unfolding. + +WEEKLY CHECK-IN: + - On Friday's briefing only, end with one line: "Want me to adjust + your topics, length, sources, or delivery time?" If I reply, update + my profile and confirm the change. + +Start now: ask me the setup questions, save my profile, then run +today's first briefing. +``` + +Expected: the agent confirms it has scheduled a task. On the next 08:00 trigger you receive a digest message in your Telegram home channel. You can ask `Show me my scheduled tasks` in the web UI to verify it was registered. + +Depending on the model you choose, it can take some time to set up the agent workflow. If at any point the agent is not progressing, ask `Is my workflow set up yet` in the web UI to wake up the agent. + +> [!NOTE] +> **Running without Telegram (web-UI delivery).** If you have not configured a Telegram channel, replace the delivery line in the prompt β€” `Deliver each briefing to my Telegram home channel.` β€” with `Deliver each briefing to the web UI (this session). Do not use any messaging channel.` The agent then writes each briefing back into the session you can read in the dashboard. Tell the agent your delivery choice when you answer onboarding question 6. (See also the **Delivery channel** row in Step 3.) + +> [!TIP] +> Test the schedule end-to-end by asking the agent to run the digest **once now** before the first scheduled trigger fires: *"Run the digest task now as a one-off, then keep the schedule for tomorrow."* This one-off runs through the **live** agent and is the most reliable end-to-end check (it produces a real briefing immediately). + +> [!IMPORTANT] +> **Register the schedule from the operator side β€” don't rely on the agent's tool call.** When the agent runs as an embedded `openclaw agent` turn (the headless path used here), its in-turn cron tool connects to the gateway with a device token that lacks the scheduler scope, so the registration is rejected with `scope upgrade pending approval … pairing required: device is asking for more scopes than currently approved`. The agent then reports it "has no built-in scheduler" or that the scheduler is "flapping." Register the recurring job yourself instead β€” this is verified to work: +> +> ```bash +> nemoclaw $SANDBOX_NAME exec -- openclaw cron add \ +> --name news-digest --cron "0 8 * * 1-5" --tz America/Los_Angeles \ +> --agent default --session-key agent:default:news-digest \ +> --message "Run my daily news briefing now and write it to this session." \ +> --no-deliver --token "" +> ``` +> +> `--no-deliver` keeps the briefing in the session (read it in the web UI) instead of pushing to a chat channel β€” required when no Telegram/Slack channel is configured, otherwise the run fails-closed with `last -> no route`. Confirm with `nemoclaw $SANDBOX_NAME exec -- openclaw cron list` and `... openclaw cron status`. (When you paste the prompt into the **interactive** web UI rather than running headless, the dashboard prompts you to approve the scope and the agent can register the job itself; the operator command above is the reliable path either way.) + +> [!IMPORTANT] +> **Scheduled triggers on a local model (vLLM).** Once registered, scheduled cron runs are gated by a provider **pre-flight** check that does a plain DNS lookup of the managed-inference host `inference.local`. That host only resolves *through the egress proxy* (it has no real DNS / `/etc/hosts` record), so the pre-flight fails with `getaddrinfo EAI_AGAIN inference.local` and the run is logged as `skipped`. Live `openclaw agent` turns (onboarding, the "run once now" one-off above, anything you type in the web UI) are unaffected β€” they reach the model fine through the proxy. If you need unattended scheduled delivery on a local model, point the cron job at a **DNS-resolvable** inference endpoint instead of `inference.local` (the `local-inference` preset already allows the host's vLLM at `host.openshell.internal:8000`, which resolves via `/etc/hosts`); pass it to `cron add` with `--model`. Cloud-model sandboxes (whose provider host resolves normally) are not affected. + +## Step 3. How to personalize + +| Knob | Where | What to change | +|------|-------|----------------| +| **Schedule** | `openclaw cron add` (operator command in Step 2) | Change the `--cron "0 8 * * 1-5"` expression and `--tz` in the registration command (`0 9 * * 1` = Mondays 09:00, `0 */6 * * *` = every 6 hours, etc.). Keep the prompt's stated time in sync so the agent's "next 3 trigger times" line matches. | +| **Sources** | `news-sources.yaml` **and** the prompt | Add the host as a new entry under `network_policies.news-sources.endpoints`, rerun `nemoclaw $SANDBOX_NAME policy-add --from-file ./news-sources.yaml --yes`, then list the URL in the prompt. The sandbox blocks any fetch to a host that is not in the allowlist. | +| **Voice** | Prompt β€” onboarding Q2 | Replace any of the four voice options (`Plain-language explainer`, `Neutral wire-service`, `Friendly newsletter`, `Executive briefing`) with your own (e.g., `Calm dad voice`, `Skeptical analyst`, `Snarky finance bro`). | +| **Length** | Prompt β€” onboarding Q3 | Replace the three length options (`60-second skim`, `3-minute read`, `10-minute deep brief`) with what suits your morning (`5-minute read`, `quick scan over breakfast`, etc.). | +| **Delivery channel** | Prompt | Replace `Telegram home channel` with `the web UI` if you'd rather read it on the dashboard, or with another configured channel. | +| **Filtering** | Prompt | Add `Only include posts that mention "Spark" or "GB10".` to focus the digest. | + +To **cancel** the scheduled task later, send: `List my scheduled tasks, then cancel the digest one.` + +## Software Development Agent + +## Software Development Agent + +The agent reads a single project directory, builds an execution plan for the features you specify, implements the features, reviews the implementation, and writes a `develop-and-review.md` back into the same directory. No outbound network beyond the local inference endpoint. + +> [!WARNING] +> Read-write filesystem access lets the agent modify files in the mounted directory. **Point it at a project copy or a clean clone, not your only working tree.** Commit or back up before granting write access. + +## Step 1. Expose the project to the sandbox + +Make a working copy of the project the agent will plan, build, and review against. Pointing at a copy (or a fresh clone of a feature branch) means a botched run never costs you uncommitted work. + +```bash +mkdir -p ~/nemoclaw-projects +cp -r ~/projects/my-app ~/nemoclaw-projects/my-app +``` + +Now copy that working copy **into** the sandbox at `/sandbox/project`. The reliable, dependency-free way is to stream a tar over `nemoclaw exec` β€” it needs nothing installed on the host and works on every sandbox: + +```bash +## Push the project into the sandbox +tar czf - -C ~/nemoclaw-projects/my-app . \ + | nemoclaw $SANDBOX_NAME exec -- bash -lc 'mkdir -p /sandbox/project && tar xzf - -C /sandbox/project' +``` + +Confirm the project landed and that the sandbox cannot reach the public internet (the local inference endpoint stays available regardless β€” that's how the agent talks to the model): + +```bash +nemoclaw $SANDBOX_NAME exec -- ls /sandbox/project # expect your project tree +nemoclaw $SANDBOX_NAME exec -- bash -lc 'curl -sS --max-time 5 https://example.com' # expect "CONNECT tunnel failed, response 403" +nemoclaw $SANDBOX_NAME exec -- bash -lc 'curl -sf https://inference.local/v1/models' # expect JSON model list +``` + +Expected: the `ls` shows your project tree, `example.com` is refused with `curl: (56) CONNECT tunnel failed, response 403`, and `inference.local` returns the model list. If `example.com` succeeds, the sandbox has unintended egress β€” run `nemoclaw $SANDBOX_NAME policy-list` and remove anything you don't need with `nemoclaw $SANDBOX_NAME policy-remove `. + +After the agent finishes (Step 2), pull the results β€” including the report β€” back to your host copy the same way: + +```bash +## Pull the project (with the agent's edits + develop-and-review.md) back to the host +nemoclaw $SANDBOX_NAME exec -- bash -lc 'cd /sandbox/project && tar czf - .' | tar xzf - -C ~/nemoclaw-projects/my-app +``` + +> [!NOTE] +> **`nemoclaw share mount` is the *opposite* direction and is optional.** `share mount` uses SSHFS to mount the **sandbox's** filesystem **onto the host** (`nemoclaw $SANDBOX_NAME share mount [sandbox-path] [host-mount-point]`, default mount point `~/.nemoclaw/mounts/`) β€” it does **not** push host files into the sandbox, so it cannot replace the `tar` push above. It is only useful for *live-editing* sandbox files from your host editor, and it requires `sshfs` on the host: +> ```bash +> sudo apt-get install -y sshfs # needs root; or: sudo dnf install fuse-sshfs +> nemoclaw $SANDBOX_NAME share mount /sandbox/project ~/nemoclaw-projects/my-app-live +> ``` +> If `sshfs` is not installed (`share mount` prints `sshfs is not installed`) and you cannot install it (no root), skip `share mount` entirely and use the `tar` push/pull above β€” they cover the whole workflow without it. If `share mount` instead fails with an SSHFS/SFTP *handshake* error, your sandbox may predate the `openssh-sftp-server` base-image update β€” run `nemoclaw $SANDBOX_NAME rebuild` (workspace state is preserved) and retry. + +## Step 2. Agent prompt + +**Copy the full prompt below and paste it into the NemoClaw web UI, the sandbox shell, or a single Telegram message to your bot.** This is the canonical prompt β€” it defines the agent's complete behavior end-to-end, and no other configuration is required. It gives the agent a one-time project profile, a six-step workflow it must follow for every feature request (SCAN β†’ PLAN β†’ IMPLEMENT β†’ SELF-REVIEW β†’ REPORT β†’ HANDOFF), an optional plan-approval checkpoint inside the PLAN step, a fixed `develop-and-review.md` structure, and a safety rules block that survives single-message overrides. + +```text +You are my senior software engineer. The project lives at /sandbox/project. +Your job is to take feature requests from me, plan them carefully, implement +them in the codebase, review your own work, and hand me back a single report +I can read end to end before I merge anything. + +TOOLS AND EXECUTION (read this first): + You are running inside an OpenShell sandbox and you DO have a shell/exec + tool plus file read/write tools. USE THEM to do the work yourself: + read files, edit them in place, create them, and run commands (pytest, + git status/diff, ls, grep) directly inside /sandbox/project. Actually + perform every change β€” never hand me copy-paste code blocks and ask me + to apply them, and never claim you "have no file-write or exec tool." + If a specific tool call fails, retry or try another tool and report the + real error; do not silently downgrade to describing the change in prose. + Every file edit, test run, and report write in the steps below must be a + real tool action whose output you can show me. + +ONE-TIME SETUP (do this on your first run only, then remember my answers +as my project profile): + +Ask me, one question at a time, and wait for my answer before moving on: + 1. What is this project for, in one sentence? (Helps you make sane + choices when a requirement is ambiguous.) + 2. Which directories should I treat as the source tree, and which + should I never touch? Defaults to include: src/, lib/, app/, + tests/. Defaults to exclude: node_modules/, dist/, build/, .git/, + .venv/, target/. + 3. Whose style should I match? Point me at a file in the repo + (CONTRIBUTING.md, .editorconfig, .eslintrc, ruff.toml, etc.) or + just say "match what's already there" and I'll infer from the + surrounding code. + 4. Test policy: write tests for every change, only when I ask, or + never? (Default: every change.) + 5. Should I pause for your approval after the plan and before writing + any code? (Default: yes β€” safer for first runs.) + 6. Where should the final report live? Default is + /sandbox/project/develop-and-review.md (overwritten each run). + Pick a per-feature path like reports/.md if you want history. + +Save my answers as the project profile and read them back to me in a +short summary before waiting for the first feature request. + +FOR EVERY FEATURE REQUEST, FOLLOW THIS WORKFLOW IN ORDER: + + 1. SCAN β€” Walk the project tree (respecting the include/exclude lists + in my profile). Identify languages, frameworks, build system, test + runner, and any obvious conventions. Output a 5-line summary + before doing anything else. + + 2. PLAN β€” For each feature I requested, produce an execution plan + with: + - Goal: one sentence describing the user-visible outcome. + - Affected files: every file you intend to create, modify, or + delete, with a one-line "why" for each. + - Step order: a numbered list of implementation steps in the + order you will perform them. + - Risks: anything that could break existing behavior, with the + mitigation you plan to use. + - Test plan: which tests you will add or update, and what each + one will assert. + If my profile says "pause for approval", stop here and print + "PLAN READY β€” reply 'approve' to proceed, or send changes" and + wait for my reply. + + 3. IMPLEMENT β€” Execute the plan one step at a time, making each change + by actually editing the files in /sandbox/project with your file/edit + tools (not by printing code for me to paste). After each step, print a + single status line: "Step N/M done: ". Never modify + files outside the planned list without asking me first. + + 4. SELF-REVIEW β€” Walk your own diff and check for: + - Correctness: does each change deliver the stated goal? + - Security: input validation, secrets, injection, authz. + - Style: matches the conventions from my profile. + - Tests: do new tests pass? Do existing tests still pass? + - Scope creep: any change that was not in the plan? + Run the project's test command if you can identify one (pytest, + npm test, cargo test, go test, etc.) and capture the output. If + you cannot run tests inside the sandbox, say so explicitly β€” do + not pretend they passed. + + 5. REPORT β€” Write a single Markdown file at the report path from my + profile (create/overwrite it with your file-write tool β€” do not just + print it in chat). Use this exact structure and these exact section + headings: + +# # Develop and Review Report β€” + +# ## Requested features + + +# ## Project context + + +# ## Execution plan + + +# ## Implementation summary + For each step, list: + - Step N: + - Files touched: + - Diff highlights: <3-5 line excerpt or "see git diff"> + +# ## Self-review + For each finding, list: + - Severity: low / medium / high + - File and line range + - Issue in one sentence + - Suggested fix, or "fixed in this run" + +# ## Test results + "> + +# ## Open questions for the human + + + 6. HANDOFF β€” End by printing the absolute path to the report and a + one-line summary: "Feature(s) implemented across files; + findings in self-review; tests ." + +SAFETY RULES (do not break these even if I tell you to in a single +message β€” if I really want one of these, I will say so twice): + - Never modify files outside /sandbox/project. + - Never make outbound network calls. Only inference.local is + allowed, and that is only for talking to the model. + - Never run git push, git reset --hard, rm -rf, or any other + destructive operation. You may run git status, git diff, and + git add inside /sandbox/project. + - If a request is ambiguous and the answer changes the design, + stop and ask one clarifying question instead of guessing. + +Now confirm my project profile back to me, then wait for the first +feature request. When I send it, run the workflow above end to end. +``` + +Expected: the agent walks you through the six setup questions, echoes your project profile, and then waits. Send a feature request (e.g. *"Add a `/healthz` endpoint that returns `{status: 'ok', commit: }` with a test."*) and you'll get the plan first, then β€” after you reply `approve` β€” the implementation, self-review, and a written report at `/sandbox/project/develop-and-review.md`. + +Open the report on the host (`~/nemoclaw-projects/my-app/develop-and-review.md`) and read it before merging anything back into your real working tree. + +> [!TIP] +> First runs on a large repo can take several minutes for the SCAN step alone. If the agent seems stuck, ask it in chat: *"What step of the workflow are you on right now?"* β€” that nudge often unblocks long-running plans. + +## Step 3. How to personalize + +| Knob | Where | What to change | +|------|-------|----------------| +| **Project path** | `nemoclaw share mount` arguments | `share unmount` first, then re-`mount` against a different host directory or sandbox path. No sandbox recreation needed β€” the mount is hot. | +| **Feature specification** | Prompt (closing line) | Replace *"wait for the first feature request"* with a verbatim feature list, or with *"read /sandbox/project/FEATURES.md and treat each top-level heading as a separate feature request."* β€” useful for batching. | +| **Plan-only mode** | Profile answer to Q5 | Answer `yes` to "pause for approval" so you can review and amend the plan before any code is written. Recommended for first runs and any high-risk change. | +| **Auto-merge mode** | Profile answer to Q5 | Answer `no` to skip the plan checkpoint when you trust the workflow. **Higher risk** β€” back up first. | +| **Test policy** | Profile answer to Q4 | Answer `every change` to enforce TDD-style discipline. Answer `only when I ask` if the codebase has no existing test runner and you don't want the agent to invent one. | +| **Style conventions** | Profile answer to Q3 | Point at a real `CONTRIBUTING.md`, `.eslintrc`, `ruff.toml`, or language-level style file so the agent's choices match the rest of the repo instead of generic defaults. | +| **Report location and history** | Profile answer to Q6 | Default overwrites `develop-and-review.md` each run. Switch to a per-feature path like `reports/.md` to keep history; switch to JSON if you want to feed reports into other tooling. | +| **Review focus** | Prompt β€” SELF-REVIEW step | Add or swap categories: performance hotspots, accessibility, internationalization, license compliance, dependency hygiene, observability. | +| **Scope limits** | Prompt β€” SAFETY RULES | Add file/dir denylists (e.g. *"Never touch migrations/, infra/, or any file ending in .lock."*) for parts of the repo you want strictly off-limits. | +| **Git workflow** | Prompt β€” SAFETY RULES | If the project uses git, allow `git commit -m ` on a feature branch by naming it in the rules. Keep `git push` blocked unless you really want remote pushes. | +| **Block any internet** | `nemoclaw policy-list` / `policy-remove` | Run `policy-list` to see what's allowed, then `policy-remove ` for any preset you don't need for this workflow (e.g. `telegram`, `github`, `pypi`). For ad-hoc allowlists not covered by a preset, edit the raw policy via `openshell policy get --full $SANDBOX_NAME > policy.yaml && $EDITOR policy.yaml && openshell policy set $SANDBOX_NAME --policy policy.yaml --wait`. More restrictive policy = lower blast radius if the model goes off-script. | +| **Deliver the report elsewhere** | Prompt β€” HANDOFF step | Add *"Also post the one-line summary to my Telegram home channel."* (Requires the Telegram channel plugin and `api.telegram.org` egress from the [news-digest](https://build.nvidia.com/spark/nemoclaw-applications/news-digest) recipe.) | + +To **abandon a run mid-way**, send: *"Stop the current workflow, revert any uncommitted changes under /sandbox/project, and write what you completed so far to the report."* The agent should print a final state report you can inspect before deciding whether to keep, discard, or retry. + +## Deck Reviewer + +## Doc & Deck Red-Team Agent + +Doc & Deck Red-Team β€” before you send or present, scans for inconsistent numbers across pages, unsourced claims, missing data, accessibility issues, and prior-version contradictions. Returns a fix list with proposed edits. + +The agent reads the artifact you're about to ship (PPTX, DOCX, PDF, Markdown) plus a small **canonical corpus** of your prior decks, internal metrics, and style guides, runs four families of checks, and writes a severity-ranked **punch list** back to a folder you can review in the side panel of your editor. Source files are never modified β€” every finding ships with a proposed edit you can accept manually. + +> [!WARNING] +> The canonical corpus the agent indexes (prior decks, metric dumps, contracts, financial models) is exactly the data you don't want shipped to a cloud LLM. Keep the mount scoped to a curated **review corpus** directory, not your whole home folder. + +## Step 1. Policy setup + +This recipe optionally layers on top of the [NemoClaw Policy Setup](https://build.nvidia.com/spark/nemoclaw-applications/policy-setup) tab's working Telegram channel (channel plugin + `api.telegram.org` egress) so the agent can DM you when a review is ready. Telegram is **optional** β€” you can also read reports from the web UI or directly on disk. + +### Create the red-team working directory + +On the host, set up four things the agent will see inside the sandbox: + +- **`queue/`** β€” drop artifacts here for review (`.pptx`, `.docx`, `.pdf`, `.md`). +- **`corpus/`** β€” your canonical metrics, prior decks, style guides, glossary, and any "source of truth" docs the agent should consult. +- **`profile.yaml`** β€” audience, severity thresholds, custom rules, glossary, contrast requirements. +- **`reports/`** and **`memory/`** β€” writable spots for punch lists and the dismissal log. + +```bash +mkdir -p ~/nemoclaw-redteam/{queue,corpus,reports,memory} +``` + +Seed the corpus with whatever the agent should treat as ground truth β€” for example: + +```bash +cp ~/decks/dgx-spark-roadmap.pptx ~/nemoclaw-redteam/corpus/ +cp ~/notes/canonical-metrics.md ~/nemoclaw-redteam/corpus/ +cp ~/style/brand-guide.md ~/nemoclaw-redteam/corpus/ +``` + +Create a starter `~/nemoclaw-redteam/profile.yaml` you can edit later: + +```yaml +audience: partner # internal | partner | public +severity_threshold: HIGH # CRITICAL only, HIGH+, MEDIUM+, all +wcag_level: AA # A | AA | AAA +font_size_min_pt: 10 +reading_grade_max: 11 # roughly 11th-grade Flesch-Kincaid +canonical_metrics: + - {name: "live playbooks count", source: "corpus/canonical-metrics.md"} + - {name: "supported categories", source: "corpus/canonical-metrics.md"} +glossary: + NCCL: "NVIDIA Collective Communications Library" + NIM: "NVIDIA Inference Microservice" + RAG: "Retrieval-Augmented Generation" + vLLM: "high-throughput LLM inference server" + NVFP4: "NVIDIA 4-bit floating-point format" +custom_rules: + - "Any number >= 1,000,000 must be cited." + - "Product name 'NemoClaw' uses capital N and C; reject 'Nemoclaw'." + - "First-use acronyms must be expanded or appear in glossary." +ignore_paths: + - "queue/.archive/**" + - "**/~$*" +``` + +### Bind the red-team directory into the sandbox + +Copy the red-team directory **into** the sandbox at `/sandbox/redteam`. The reliable, dependency-free way is to stream a tar over `nemoclaw exec` β€” it needs nothing installed on the host and works on every sandbox: + +```bash +## Push queue/, corpus/, profile.yaml, reports/, memory/ into the sandbox +tar czf - -C ~/nemoclaw-redteam . \ + | nemoclaw $SANDBOX_NAME exec -- bash -lc 'mkdir -p /sandbox/redteam && tar xzf - -C /sandbox/redteam' +``` + +(Optional, strongly recommended) Make `queue/`, `corpus/`, and `profile.yaml` read-only and keep `reports/`/`memory/` writable β€” run the `chmod` **inside the sandbox** (host-side `chmod` does not reach the sandbox copy, since the files now live in the sandbox). This denies the agent (which runs as the unprivileged `sandbox` user) write access to your source artifacts and ground-truth corpus: + +```bash +nemoclaw $SANDBOX_NAME exec -- bash -lc 'chmod -R a-w /sandbox/redteam/queue /sandbox/redteam/corpus /sandbox/redteam/profile.yaml && chmod -R u+w /sandbox/redteam/reports /sandbox/redteam/memory' +``` + +Confirm the read paths list your files, the write paths really are writable, the read-only paths really are not, and that the sandbox has **no outbound network** (URL verification is opt-in, not default): + +```bash +nemoclaw $SANDBOX_NAME exec -- ls /sandbox/redteam/queue # expect the artifacts you dropped in +nemoclaw $SANDBOX_NAME exec -- ls /sandbox/redteam/corpus # expect your corpus files +nemoclaw $SANDBOX_NAME exec -- bash -c 'echo test > /sandbox/redteam/reports/.write-check && rm /sandbox/redteam/reports/.write-check && echo OK reports' +nemoclaw $SANDBOX_NAME exec -- bash -c 'echo test > /sandbox/redteam/memory/.write-check && rm /sandbox/redteam/memory/.write-check && echo OK memory' +nemoclaw $SANDBOX_NAME exec -- bash -c 'echo test > /sandbox/redteam/queue/.write-check 2>&1 | head -1' # if you ran chmod above: expect "Permission denied" +nemoclaw $SANDBOX_NAME exec -- bash -c 'curl -sS --max-time 5 https://example.com' # expect "CONNECT tunnel failed, response 403" +``` + +Expected: read paths list the files you dropped in, both write checks print `OK …`, the write into `queue/` reports `Permission denied` (when you ran the `chmod` step), and `example.com` is refused with `curl: (56) CONNECT tunnel failed, response 403`. When the agent finishes (Step 2), pull the punch lists back to the host: + +```bash +## Pull reports/ (and memory/) back to your host copy +nemoclaw $SANDBOX_NAME exec -- bash -lc 'cd /sandbox/redteam && tar czf - reports memory' | tar xzf - -C ~/nemoclaw-redteam +``` + +> [!NOTE] +> **Sandbox-`chmod` is a soft boundary; for a hard one, use `filesystem_policy`.** Because the files live in the sandbox and are owned by the `sandbox` user, that same user could in principle `chmod` them back β€” the `a-w` above stops *accidental* writes and honors the agent's read-only intent, but it is not injection-proof. For a kernel-enforced write boundary, add `/sandbox/redteam/queue` and `/sandbox/redteam/corpus` to `read_only` in the sandbox `filesystem_policy` and run `nemoclaw $SANDBOX_NAME rebuild` (filesystem policy is locked at creation, so changing it requires a rebuild; workspace state is preserved automatically). + +> [!NOTE] +> **`nemoclaw share mount` is the *opposite* direction and is optional.** `share mount` uses SSHFS to mount the **sandbox's** filesystem **onto the host** (`nemoclaw $SANDBOX_NAME share mount [sandbox-path] [host-mount-point]`) β€” it does **not** push host files into the sandbox, so it cannot replace the `tar` push above; it is only for live-editing sandbox files from a host editor. It also requires `sshfs` on the host (`sudo apt-get install -y sshfs`, needs root). If `share mount` prints `sshfs is not installed` and you can't install it, ignore it β€” the `tar` push/pull covers the whole workflow. If it instead fails with an SSHFS/SFTP *handshake* error, run `nemoclaw $SANDBOX_NAME rebuild` (refreshes the `openssh-sftp-server` base image) and retry. + +> [!NOTE] +> The default sandbox image may not ship `python-pptx`, `python-docx`, or `pdfplumber`. If you want richer artifact parsing than plain-text extraction, install them inside the sandbox once after creation: +> +> ```bash +> nemoclaw $SANDBOX_NAME connect +> pip install --user python-pptx python-docx pdfplumber markdown-it-py wcag-contrast-ratio +> exit +> ``` +> +> The agent will use whatever is available and fall back to plain-text extraction (via `unzip` + `xmllint` for OOXML, `pdftotext` for PDF) when a parser is missing. + +## Step 2. Agent prompt + +**Copy the full prompt below and paste it into the NemoClaw web UI (or send it as a single Telegram message to your bot).** This is the canonical prompt β€” it defines the agent's complete behavior end-to-end, and no other configuration is required. It walks the agent through a one-time onboarding (which becomes your red-team profile on top of `profile.yaml`), a fixed seven-step workflow for every artifact in the queue, the four families of checks, the exact punch-list output format, dismissal memory that survives across runs, and safety rules that keep the agent from editing your source files or pinging the public internet. + +```text +You are my doc and deck red-team. Your only job is to catch problems +in artifacts I'm about to send or present β€” before the audience does. +You never edit my source files. You propose fixes I can accept or +reject myself. + +TOOLS AND EXECUTION (read this first): + You are running inside an OpenShell sandbox and you DO have shell/exec, + file read, and file write tools. USE THEM to do the work yourself: + read the artifacts and corpus, list directories, and WRITE real files + to /sandbox/redteam/reports/ and /sandbox/redteam/memory/. When a step + says "save" or "write", that means actually create the file with your + file-write tool and then confirm it exists β€” never just print the + content in chat and claim you saved it, and never say you "have no + file-write or exec tool." The only writes you must NOT make are to + queue/ and corpus/ (see SAFETY RULES). If a tool call fails, retry or + try another tool and report the real error. + +CONTEXT YOU CAN READ: + - /sandbox/redteam/queue/ β€” artifacts I want reviewed + (.pptx, .docx, .pdf, .md). Treat every file here as a candidate + unless it matches profile.yaml ignore_paths. + - /sandbox/redteam/corpus/ β€” canonical metrics, prior decks, + style guide, glossary, "source of truth" docs. + - /sandbox/redteam/profile.yaml β€” audience, severity threshold, + WCAG level, custom rules, glossary, canonical-metric pointers. + +CONTEXT YOU CAN WRITE: + - /sandbox/redteam/reports/ β€” your punch lists go here. + - /sandbox/redteam/memory/ β€” dismissals.jsonl and per-artifact + history so you don't re-flag rejected findings. + +ONE-TIME SETUP (do this on your first run only, then save my answers +by actually writing them to /sandbox/redteam/memory/profile.json with +your file-write tool β€” then confirm the file exists): + +Ask me, one question at a time, and wait for my answer: + 1. Who's the primary audience for these artifacts? Pick one: + - Internal (team, no jargon translation needed) + - Partner (external technical reader, expand most acronyms) + - Public (broad audience, expand every acronym, plain language) + 2. What severity threshold should land in my Telegram inbox? + Options: CRITICAL only, HIGH and above, MEDIUM and above, all. + 3. How should I rank findings when there's a tie? Pick one: + - "Reader trust first" β€” externally visible mistakes (numbers, + claims, contradictions) outrank craft issues. + - "Craft first" β€” accessibility and style outrank truthiness + (use when shipping to a regulated audience). + - "By page order" β€” top-to-bottom, no ranking. + 4. How should I handle dismissals? Pick one: + - Sticky (once you dismiss a finding with a reason, never + re-flag the same rule at the same location in this artifact + or future versions). + - Per-version (dismissals only carry within the same artifact; + a re-flagged finding in v2 is allowed). + - None (re-flag every run; I'll re-dismiss each time). + 5. Where should the final punch list be delivered? + - File only (write to reports/, I open it myself) + - File + Telegram summary (one-line per CRITICAL/HIGH, plus + a link/path to the full report) + - File + full Telegram (entire punch list in chat β€” fine for + short docs, noisy for big decks) + 6. CRITICAL findings β€” can I ever auto-dismiss them? + Answer must be NO. (This is a hard rule; I'm asking so you + remember it.) If I answer anything other than no, ask again. + +Save my answers, read them back, then wait for me to say "run" or +"run on ". When I do, run the workflow below. + +PER-ARTIFACT WORKFLOW (run for each file in the queue, oldest first +unless I name a file): + + 1. INGEST β€” Identify the artifact type from the extension. Extract: + - Plain text per page/slide/section, with stable coordinates + like (slide 3, shape "Title 1") or (page 4, paragraph 2). + - Tables as rows + headers, preserving page/slide. + - Image metadata: alt-text, caption, decorative flag. OCR the + image if alt-text is missing AND profile.yaml.audience is + partner or public. + - Outline/TOC vs actual section order. + Print a one-line summary: "Ingested : slides/pages, + tables, images, with alt-text." + + 2. CLAIM MAP β€” Build an index of every: + - Quantitative statement (number + unit + what it counts + + coordinates). + - Named entity (product, person, org, customer, partner). + - Citation (footnote, in-line URL, reference). + - Acronym first-use (and whether it's expanded or in glossary). + - Figure / table caption. + Save the map to memory/-claims.json so the next + run can diff against it. + + 3. RUN FOUR FAMILIES OF CHECKS: + + A) INTERNAL CONSISTENCY + - Same metric appearing in N places β€” do all N agree? + - TOC and section count match reality? + - Acronyms expanded on first use OR present in profile glossary? + - Footnotes reference defined sources? No dangling [1], [2]? + - Slide numbers, headers, and footers consistent? + + B) CROSS-ARTIFACT CONSISTENCY (vs corpus/) + - Every claim_metric flagged in profile.yaml.canonical_metrics + β€” does this artifact match the canonical value in corpus? + - Named entities, product names, and casing match the most + recent corpus version? (e.g. "NemoClaw" vs "Nemoclaw".) + - Numbers that also appear in a prior deck in corpus β€” do + they match, and if not, which one is newer? + + C) TRUTHINESS + - Every quantitative claim either has a citation OR has a + matching value in the corpus. Flag orphans as "no source". + - Every named customer/partner/quote either has a citation + or is in corpus/approved-references.md. Flag orphans. + - Never invent a citation. If a claim has no source and the + corpus has no match, flag it β€” do not paper over it. + + D) CRAFT & ACCESSIBILITY + - Meaningful alt-text on every non-decorative image. + Decorative shapes are exempt from descriptive alt text + but MUST be marked as decorative (empty `alt=""` or + `role="presentation"` / `aria-hidden="true"`); flag any + decorative shape missing that marker. + - WCAG contrast at the level in profile.yaml.wcag_level for all + text-over-fill. Report computed ratio + threshold + which + color pair fails. + - Font size >= profile.yaml.font_size_min_pt for all body text. + - Reading grade <= profile.yaml.reading_grade_max (Flesch-Kincaid + or similar). Flag sections that drift higher. + - Tone drift between sections (very formal section next to + chatty section β€” flag as MEDIUM). + - Custom rules from profile.yaml.custom_rules β€” run each. + + 4. RANK β€” Assign severity per this scale: + CRITICAL Externally visible factual mismatch, broken claim, + or accessibility failure that legally matters. + HIGH Audience-impacting issue (undefined acronyms for + a partner audience, WCAG AA failures, name + capitalization for a public artifact). + MEDIUM Craft / clarity issue that costs trust over time + (tone drift, shortened titles that lose meaning, + decorative shapes not flagged as decorative β€” + missing empty `alt=""` or + `role="presentation"`/`aria-hidden`). + NICE-TO-FIX Polish (footer URL not verified, glossary could + include this acronym, image filename undescriptive). + Apply the tie-break rule from my profile (Q3) inside each + severity bucket. + + 5. APPLY DISMISSAL MEMORY β€” Read + /sandbox/redteam/memory/dismissals.jsonl. Each line is: + {"artifact": "", "rule_id": "", + "location": "", "reason": "", + "scope": "this-version" | "all-versions"} + Drop any finding that matches an active dismissal under the + dismissal mode from my profile (Q4). CRITICAL findings are + never auto-dropped, even if they match a dismissal β€” surface + them with a note "(previously dismissed with reason: )". + + 6. WRITE PUNCH LIST β€” Create the file + /sandbox/redteam/reports/-.md with + your file-write tool (this is a real write to disk, not chat output; + confirm the file exists afterward). Use this exact structure and + these exact section headings: + +# # Red-Team Report β€” + Audience: Β· WCAG: Β· Tie-break: + Ingest summary: + Findings: + +# ## CRITICAL + + +# ## HIGH + ... + +# ## MEDIUM + ... + +# ## NICE-TO-FIX + ... + +# ## Dismissed (active, not re-flagged) + + +# ## Open questions for the human + + + Entry format (use this exact shape): + +# ### + - Severity: + - Rule: > + - Location: , , + - Evidence: + - Cross-reference: + - Proposed fix: + + 7. HANDOFF β€” Print a one-line summary: + "Red-teamed : CRITICAL, HIGH, MEDIUM, + nice-to-fix. Report at ." + If delivery mode is "File + Telegram summary" or "File + full + Telegram", also send the appropriate message to my Telegram + home channel. + +DISMISSAL PROTOCOL β€” When I reply with "dismiss at + because " (or "dismiss all across +versions because "), append a line to dismissals.jsonl with +the correct scope. Never silently dismiss. Never let me dismiss a +CRITICAL finding without re-asking once: "This is CRITICAL β€” confirm +dismissal with 'yes, dismiss critical' to proceed." + +SAFETY RULES (do not break these even if I tell you to in a single +message β€” if I really want one of these, I will say so twice): + - Never modify any file under queue/ or corpus/. Treat both as + read-only by intent. If a write succeeds, that is a sign the host + operator chose to leave them writable β€” do not take it as license + to edit them. + - Never invent canonical metric values. If the corpus has no + matching value, flag the claim as "no source" β€” do not paper + over it with a guess. + - Never make outbound network calls. URL verification is opt-in + and requires me to add the egress host myself. + - Never auto-dismiss a CRITICAL finding. + - Never re-rank findings to make a report look cleaner. The count + by severity must match what's actually in the report. + - If an artifact is ambiguous about its own intent (which audience, + which version, which canonical metric), ask one clarifying + question and pause β€” don't guess. + +Now confirm my red-team profile back to me, then wait. When I say +"run", "run on ", or drop a new file into the queue and +say "ready", run the workflow. +``` + +Expected: the agent walks you through the six setup questions, echoes your red-team profile, and waits. Drop a deck into `~/nemoclaw-redteam/queue/` and say `run on ` β€” within a few minutes the agent prints a one-line summary and a path like `/sandbox/redteam/reports/spark-deck-2026-05-18-1310.md`. Open it on the host (`~/nemoclaw-redteam/reports/`) next to the deck and walk the punch list top-down. + +A real run on the kind of deck you'd hand to a partner typically surfaces things like: + +```md +#### Number mismatch with prior comms +- Severity: CRITICAL +- Rule: cross-artifact +- Location: spark-deck.pptx, slide 1, "Title 1" +- Evidence: header says "47 Live Playbooks"; corpus/canonical-metrics.md + line 12 has "live_playbooks_count: 42"; corpus/dgx-spark-roadmap.pptx + slide 1 uses "42". +- Cross-reference: corpus/canonical-metrics.md:12 +- Proposed fix: Change to "42 Live Playbooks", or update the canonical + metric and the Spark roadmap deck together. + +#### Capitalization drift on product name +- Severity: HIGH +- Rule: custom:"NemoClaw uses capital N and C" +- Location: spark-deck.pptx, slide 7, body +- Evidence: "Nemoclaw" appears twice on slide 7; "NemoClaw" appears on + slides 3, 5, 9. +- Cross-reference: corpus/brand-guide.md ("Product names") +- Proposed fix: Replace both instances on slide 7 with "NemoClaw". + +#### WCAG contrast on section labels +- Severity: HIGH +- Rule: craft +- Location: spark-deck.pptx, 18 instances of green section labels +- Evidence: #76B900 on #FFFFFF β†’ contrast ratio 2.4 : 1, fails AA Normal + (threshold 4.5 : 1). +- Cross-reference: profile.yaml.wcag_level = AA +- Proposed fix: #5A8E00 (~4.1 : 1) still fails AA Normal β€” darken further + until contrast clears 4.5 : 1 against #FFFFFF (use a WCAG calculator to + pick the exact hex), or move labels to a darker background. +``` + +> [!TIP] +> Run the red-team **before** you think the artifact is done. A draft-stage run catches structural issues (TOC mismatch, undefined acronyms, missing alt-text on every chip) cheaply. A "final" run should be quick β€” if it isn't, you shipped too late. + +## Step 3. How to personalize + +| Knob | Where | What to change | +|------|-------|----------------| +| **Artifact queue path** | `nemoclaw share mount` source | `share unmount` first, then re-`mount` against a different host directory. Or just drop files into `~/nemoclaw-redteam/queue/` on the host β€” they appear at `/sandbox/redteam/queue/` instantly. Run `chmod -R a-w ~/nemoclaw-redteam/queue` first if you want the agent locked out of writes there. | +| **Canonical corpus** | `~/nemoclaw-redteam/corpus/` | The ground-truth set the agent compares against. Curate it β€” every file here becomes "what we know to be true". Stale corpus = stale flags. | +| **Audience profile** | Profile Q1 (or edit `profile.yaml.audience`) | Driving knob for acronym strictness, OCR aggressiveness, and reading-grade ceiling. Default to the strictest audience you ship to. | +| **Severity threshold for notification** | Profile Q2 | Default to HIGH+. Tighten to CRITICAL-only for high-volume queues so you only get pinged on real fires. | +| **Tie-break rule** | Profile Q3 | "Reader trust first" for sales/partner decks. "Craft first" for regulated audiences. "By page order" for quick first-pass cleanup. | +| **Custom rules** | `profile.yaml.custom_rules` | Add one-line rules in plain English. The agent treats each as a rule with id `custom:`. Good for canonical phrasing, brand-name capitalization, "any number β‰₯ 1M must be cited", forbidden words. | +| **Glossary** | `profile.yaml.glossary` | Acronyms here are treated as "defined" β€” the agent won't flag them as undefined first-use. Add the acronyms your audience knows, leave out the ones they don't. | +| **Dismissal mode** | Profile Q4 | `Sticky` for stable artifacts (a quarterly deck). `Per-version` when you actively iterate. `None` for first-time reviews of an audience you don't know yet. | +| **Delivery channel** | Profile Q5 | `File only` for solo reviews. `File + Telegram summary` once you trust the agent's calibration. `File + full Telegram` only for short docs (<10 findings). | +| **WCAG level and font minimums** | `profile.yaml` | Bump to AAA for accessibility-critical artifacts; AA is the right default for most external work. Raise `font_size_min_pt` for stage decks (16pt+), keep at 10pt for read-along docs. | +| **Output format** | Prompt β€” WRITE PUNCH LIST step | Swap Markdown for JSON if you want to feed reports into another tool. Add a CSV summary alongside the MD for spreadsheet triage. | +| **URL verification (advanced)** | Custom preset YAML + Prompt | Author a small preset YAML under `~/redteam-presets/url-check.yaml` with `network_policies` entries for the specific hosts (e.g. `build.nvidia.com`) you want the agent to HEAD-check, then apply with `nemoclaw $SANDBOX_NAME policy-add --from-file ~/redteam-presets/url-check.yaml --yes`. Remove later with `nemoclaw $SANDBOX_NAME policy-remove --yes`. **Higher risk** β€” every added host expands the egress surface. Keep the list small. | +| **Background watcher mode** | Outside the sandbox | A small host-side `inotifywait` (or cron) on `queue/` can DM the agent `run on ` whenever a file lands. Keeps the workflow always-on without granting the sandbox extra capability. | +| **Multi-artifact comparison** | Prompt β€” INGEST step | When two related files are in the queue (`spark-deck.pptx` + `dgx-spark-roadmap.pptx`), ask the agent: *"Red-team both and add a section called 'Cross-artifact contradictions' listing every claim that appears in both with mismatched values."* | +| **Dismissal audit** | `~/nemoclaw-redteam/memory/dismissals.jsonl` | Open this file periodically. If a rule is dismissed everywhere, it's probably the wrong rule β€” delete it from `profile.yaml.custom_rules` so the agent stops generating noise. | +| **Hand off the summary to news-digest** | Prompt β€” HANDOFF step | Add *"Also include a line in tomorrow's morning digest with the count of HIGH+ findings I haven't acted on yet."* (Requires the [news-digest](https://build.nvidia.com/spark/nemoclaw-applications/news-digest) recipe.) | + +To **dismiss a finding**, reply: `dismiss at because ` (or `dismiss all across versions because ` for a sticky cross-artifact dismissal). The agent appends to `memory/dismissals.jsonl` and confirms. + +To **revisit a previously dismissed finding**, ask: `show active dismissals for `. Open `memory/dismissals.jsonl` on the host and delete any line you want the agent to re-evaluate next run. + +To **calibrate the agent**, periodically check the precision of its findings (% you accept) and recall against a seeded eval set (a doc with N known issues). The agent is doing its job when precision > 70% and recall > 90% on the eval set. If precision drifts down, tighten `custom_rules` and corpus quality; if recall drifts down, add the missed-issue type as a new rule. + +## Calendar Negotiator + +## Calendar Negotiation Agent + +Calendar Negotiation β€” handles "when can we meet?" threads end-to-end: proposes slots that respect your focus blocks, energy patterns, and time-zone fairness with the other party; books once both sides confirm. + +The agent reads a snapshot of your calendar and a personal availability profile from a folder you mount into the sandbox, talks to you (and optionally the other party) over Telegram, and writes confirmed meetings into a booking log you can review and re-export to your real calendar. + +> [!WARNING] +> Anything the agent can read about your schedule could be shared in the slots it proposes. **Mount only the calendar window the agent needs** (e.g. the next 4 weeks, with sensitive event titles redacted to `BUSY`) β€” not your entire calendar history. + +## Step 1. Policy setup + +Telegram is **optional**. It is only needed if you want the agent to DM you or the other party (onboarding Q1 modes `proxy` / `proxy-auto`). In **propose-only** mode β€” the recommended default, and what this guide uses β€” the agent just shows you drafts in the web UI / session and writes booking files to disk, so **no Telegram channel, no `api.telegram.org` egress, and no public tunnel are required.** You can run the entire workflow Telegram-free. + +If you *do* want Telegram relay, layer this recipe on top of the [NemoClaw Policy Setup](https://build.nvidia.com/spark/nemoclaw-applications/policy-setup) tab's working Telegram channel first and confirm it is registered: + +```bash +nemoclaw $SANDBOX_NAME status | grep -i telegram # only needed for proxy / proxy-auto modes +``` + +A line showing the Telegram channel means it is wired in. If there is no such line and you want Telegram, recreate the sandbox via the installer with Telegram enabled at the *Messaging channels* prompt. Otherwise, ignore this and continue in propose-only mode. + +### Create the calendar working directory + +On the host, set up three things the agent will see inside the sandbox: + +- **`calendar.ics`** β€” a snapshot of your busy/free time for the negotiation window (next 4–6 weeks is plenty). +- **`profile.yaml`** β€” your working hours, focus blocks, energy patterns, timezone, and any always-blocked periods. +- **`bookings/`** β€” a writable directory the agent uses to track in-flight negotiations and write confirmed meetings. + +```bash +mkdir -p ~/nemoclaw-calendar/bookings +``` + +Export your calendar to ICS β€” for example, in Google Calendar use *Settings β†’ Import & export β†’ Export* and copy just the relevant calendar into `~/nemoclaw-calendar/calendar.ics`. Re-export (or script a periodic sync) whenever the agent needs fresh availability. + +Create a starter `~/nemoclaw-calendar/profile.yaml` you can edit later: + +```yaml +timezone: America/Los_Angeles +working_hours: + mon: ["09:00", "17:30"] + tue: ["09:00", "17:30"] + wed: ["09:00", "17:30"] + thu: ["09:00", "17:30"] + fri: ["09:00", "15:00"] +focus_blocks: + - {day: mon, start: "09:00", end: "11:30", label: "deep work"} + - {day: wed, start: "09:00", end: "11:30", label: "deep work"} +energy_patterns: + high_energy: ["09:00-12:00"] + low_energy: ["14:00-15:30"] +defaults: + meeting_duration_minutes: 30 + buffer_minutes: 10 + max_meetings_per_day: 5 +blackout_periods: + - {start: "2026-06-20", end: "2026-06-28", reason: "vacation"} +preferences: + prefer_back_to_back: false + no_meetings_after: "16:00" + fairness_rule: "split discomfort β€” alternate who takes the off-hours slot when timezones don't overlap nicely" +``` + +### Bind the calendar directory into the sandbox + +Copy the calendar directory **into** the sandbox at `/sandbox/calendar`. The reliable, dependency-free way is to stream a tar over `nemoclaw exec` β€” it needs nothing installed on the host and works on every sandbox: + +```bash +## Push calendar.ics, profile.yaml, and bookings/ into the sandbox +tar czf - -C ~/nemoclaw-calendar . \ + | nemoclaw $SANDBOX_NAME exec -- bash -lc 'mkdir -p /sandbox/calendar && tar xzf - -C /sandbox/calendar' +``` + +(Optional, strongly recommended) Make `calendar.ics` and `profile.yaml` read-only and keep `bookings/` writable β€” run the `chmod` **inside the sandbox** (the files now live there, so a host-side `chmod` would not reach them). The agent runs as the unprivileged `sandbox` user, so this denies it any overwrite of your source-of-truth calendar: + +```bash +nemoclaw $SANDBOX_NAME exec -- bash -lc 'chmod a-w /sandbox/calendar/calendar.ics /sandbox/calendar/profile.yaml && chmod -R u+w /sandbox/calendar/bookings' +``` + +Confirm the files landed, the write boundary holds, and the sandbox has no outbound network: + +```bash +nemoclaw $SANDBOX_NAME exec -- ls /sandbox/calendar # expect calendar.ics, profile.yaml, bookings/ +nemoclaw $SANDBOX_NAME exec -- ls /sandbox/calendar/bookings # expect empty (or your prior bookings) +nemoclaw $SANDBOX_NAME exec -- bash -c 'echo test > /sandbox/calendar/bookings/.write-check && rm /sandbox/calendar/bookings/.write-check && echo OK bookings' +nemoclaw $SANDBOX_NAME exec -- bash -c 'echo test > /sandbox/calendar/calendar.ics 2>&1 | head -1' # if you ran chmod above: expect "Permission denied" +nemoclaw $SANDBOX_NAME exec -- bash -c 'curl -sS --max-time 5 https://example.com' # expect "CONNECT tunnel failed, response 403" +``` + +Expected: `ls /sandbox/calendar` shows `calendar.ics`, `profile.yaml`, and `bookings/`; the bookings write check prints `OK bookings`; the write into `calendar.ics` reports `Permission denied` (when you ran the `chmod` step); and `example.com` is refused with `curl: (56) CONNECT tunnel failed, response 403`. When the agent has written bookings (Step 2), pull them back to the host: + +```bash +## Pull bookings/ (confirmed meetings + log.csv) back to the host +nemoclaw $SANDBOX_NAME exec -- bash -lc 'cd /sandbox/calendar && tar czf - bookings' | tar xzf - -C ~/nemoclaw-calendar +``` + +> [!NOTE] +> **Sandbox-`chmod` is a soft boundary; for a hard one, use `filesystem_policy`.** The files are owned by the `sandbox` user, so that user could in principle `chmod` them back β€” `a-w` stops *accidental* overwrites and honors read-only intent, but it is not injection-proof. For a kernel-enforced boundary, add `/sandbox/calendar/calendar.ics` and `/sandbox/calendar/profile.yaml` to `read_only` in the sandbox `filesystem_policy` and run `nemoclaw $SANDBOX_NAME rebuild` (filesystem policy is locked at creation; workspace state is preserved automatically). + +> [!NOTE] +> **`nemoclaw share mount` is the *opposite* direction and is optional.** `share mount` uses SSHFS to mount the **sandbox's** filesystem **onto the host**, not host files into the sandbox β€” so it cannot replace the `tar` push above; it is only for live-editing sandbox files from a host editor, and it requires `sshfs` on the host (`sudo apt-get install -y sshfs`, needs root). If it prints `sshfs is not installed` and you can't install it, ignore it β€” the `tar` push/pull covers the whole workflow. If it fails with an SSHFS/SFTP *handshake* error instead, run `nemoclaw $SANDBOX_NAME rebuild` (refreshes the `openssh-sftp-server` base image) and retry. + +> [!NOTE] +> **Telegram relay / public tunnel β€” only if you use Telegram.** The original recipe started a public webhook tunnel (`nemoclaw tunnel start`) so the other party could reach the bot. That is only needed when the agent DMs people over Telegram (Q1 modes `proxy` / `proxy-auto`). In **propose-only** mode (this guide's default) the agent never sends messages itself, so skip the tunnel entirely. (`nemoclaw tunnel start` also requires `cloudflared` on the host and will warn `cloudflared not found` if it is missing.) + +## Step 2. Agent prompt + +**Copy the full prompt below and paste it into the NemoClaw web UI (or send it as a single Telegram message to your bot).** This is the canonical prompt β€” it defines the agent's complete behavior end-to-end, and no other configuration is required. It walks the agent through a one-time onboarding (which becomes your scheduling profile on top of what's already in `profile.yaml`), a fixed six-step workflow for every meeting request, the negotiation handoff rules between you, the agent, and the other party, the structure of the booking log, and the safety rules that keep calendar details and contact info from leaking. + +```text +You are my personal scheduling chief of staff. Your only job is to turn +"when can we meet?" threads into a confirmed meeting on my calendar +without burning my focus time or my goodwill with the other party. + +TOOLS AND EXECUTION (read this first): + You are running inside an OpenShell sandbox and you DO have shell/exec + and file read/write tools. USE THEM: read /sandbox/calendar/calendar.ics + and profile.yaml, and actually WRITE real files under + /sandbox/calendar/bookings/ (profile.json, the booking .md, log.csv) β€” + then confirm they exist. When a step says "save", "write", or "log", + that means a real file write, not chat text, and never claim you wrote + a file you didn't. The only paths you must not overwrite are + calendar.ics and profile.yaml. In propose-only mode, make NO network + calls and use NO messaging channel β€” just print drafts in this session + for me to copy/paste. + +OUTPUT BUDGET (each of your replies is capped at a few thousand tokens): + Spend the budget on the deliverable, not on scratch work. Keep PARSE, + LOAD, and SCORE to a few terse lines each β€” for SCORE, print ONLY the + final top-N chosen slots (one line each: slot in both TZs + a short + why), never a full candidate sweep, per-constraint dump, or large + tables. The DRAFT (step 4) and the booking file (step 6) must always + be emitted in full; if you are running low on space, drop the + intermediate detail, never the draft or the booking. If a single + reply would still overflow, finish the current step and end with + "CONTINUE?" so I can prompt you for the next step. + +CONTEXT YOU CAN READ: + - /sandbox/calendar/calendar.ics β€” my busy/free snapshot. Treat every + existing event as immovable unless I tell you otherwise. + - /sandbox/calendar/profile.yaml β€” my working hours, focus blocks, + energy patterns, defaults, blackouts, preferences. + - /sandbox/calendar/bookings/ β€” your scratch space. You may read and + write any file here. + +ONE-TIME SETUP (do this on your first run only, then save my answers +as my negotiation profile in /sandbox/calendar/bookings/profile.json): + +Ask me, one question at a time, and wait for my answer: + 1. How should I talk to the other party? Pick one: + - Propose-only (you draft, I copy/paste to them myself) + - Proxy (you DM them directly via Telegram once I approve the draft) + - Proxy-auto (you DM them directly with no checkpoint after the + first successful negotiation β€” higher risk) + 2. How many slot options should I propose at once? (Default: 3) + 3. What's my default meeting length when the other party doesn't say? + (Default: pull from profile.yaml.) + 4. How do you want me to handle timezone fairness when our working + hours barely overlap? Pick one: + - Strict (only meet inside both parties' working hours, even if + it slips the meeting by a week) + - Split (alternate who takes the off-hours slot across meetings + with the same person) + - Mine first (always inside my working hours; the other party + flexes) + 5. What information about my calendar may I share? + - Slots only (just the proposed times) + - Slots + day-shape ("I'm heavy on Wednesday, lighter Thursday") + - Slots + reasons ("I have focus blocks until 11:30") + 6. What's my approval threshold for booking? Options: + - Always ask before I book + - Ask only if the slot lands in a focus block, low-energy + window, or after my "no meetings after" time + - Never ask (auto-book once both sides confirm) β€” highest risk + +Confirm my answers back, then wait for the first meeting request. + +FOR EVERY MEETING REQUEST, FOLLOW THIS WORKFLOW IN ORDER: + + 1. PARSE β€” Extract from the request: who is asking, what the meeting + is for, requested duration (fall back to my default if missing), + other party's timezone (ask if missing), any hard constraints + they named ("this week", "before Friday", "30 min max"), urgency. + Print a 3-line summary: "From: , For: , Constraint: + ". + + 2. LOAD β€” Read calendar.ics and profile.yaml fresh every run (do not + trust a cached version from a prior request β€” calendars change). + Read my negotiation profile from bookings/profile.json. + + 3. SCORE β€” For the next N working days (N = 14 unless the request + constrains it tighter), generate every candidate slot that: + - Fits inside both parties' working hours under the fairness + rule from my profile. + - Does not collide with any calendar.ics event or its buffer. + - Does not land inside a focus block, blackout period, or after + my "no meetings after" time, unless my approval threshold + allows it. + - Respects my max_meetings_per_day from profile.yaml. + Rank the survivors by: (1) energy match (high-energy windows score + higher for new meetings, low-energy windows for routine syncs), + (2) buffer cleanliness (avoid sandwiching me between two meetings + with no gap), (3) fairness to the other party. Pick the top + N_slots from my profile. + + 4. DRAFT β€” Compose a proposal in my voice for the other party. Use + their timezone. Format as: + + Hi , + + Happy to find time for . Here are 3 options that work + on my side β€” all times in : + - + - + - + + Let me know which works, or send a couple of windows that suit + you and I'll come back with another set. + + Show the draft to me first. Wait for my reply ("send", "send with + edits: ...", or "skip"). Honor my communication mode from the + profile β€” never DM the other party in proxy-auto mode without + having first earned it in proxy mode on a prior successful round. + + 5. RELAY AND NEGOTIATE β€” Send the approved draft via Telegram. When + the other party replies: + - If they pick one of my slots: jump to step 6. + - If they propose new windows: re-run SCORE against those + windows, pick the best one(s) that pass my constraints, and + draft a one-line confirmation ("Wednesday 2pm PT works for + me β€” sending the invite now."). Show me first under the same + approval rule. + - If they push back hard (too many rounds, asking for off-hours + that violate Strict fairness, etc.): escalate to me with a + one-line summary and recommended next move. + + 6. BOOK AND LOG β€” Once both sides confirm, write the confirmed meeting + to /sandbox/calendar/bookings/-.md with this + exact structure: + +# # with + - When: + - With: , + - Where: