# cuTile Kernels > Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300 ## Table of Contents - [Overview](#overview) - [Kernel Benchmarks](#kernel-benchmarks) - [End-to-End Inference](#end-to-end-inference) - [FMHA Implementation](#fmha-implementation) - [Attention Basics](#attention-basics) - [Flash Attention Algorithm](#flash-attention-algorithm) - [cuTile Pseudocode → Actual Mapping](#cutile-pseudocode-actual-mapping) - [Kernel Pseudocode](#kernel-pseudocode) - [cuTile Implementation](#cutile-implementation) - [Launching the Kernel](#launching-the-kernel) - [Optimizations](#optimizations) - [Platform Configuration](#platform-configuration) - [Performance Results](#performance-results) - [Common Issues](#common-issues) - [Companion Scripts](#companion-scripts) - [References](#references) - [Platform Comparison](#platform-comparison) - [End-to-End Throughput](#end-to-end-throughput) - [CUDA Kernel Time](#cuda-kernel-time) - [cuTile Kernel Breakdown](#cutile-kernel-breakdown) - [Troubleshooting](#troubleshooting) --- ## Overview ## Basic idea [TileGym](https://github.com/NVIDIA/TileGym) is NVIDIA's benchmark suite and integration framework for cuTile kernels - high-performance GPU kernels written using the cuTile Python DSL. cuTile compiles to Tile IR, enabling developers to write efficient kernels without low-level CUDA programming. This playbook covers three workflows: 1. **[Kernel Benchmarks](kernel-benchmarks)** - Run standalone cuTile kernel benchmarks (FMHA, MatMul, RMSNorm, etc.) 2. **[End-to-End Inference](e2e-inference)** - Run LLM inference with cuTile-optimized kernels via monkey-patching 3. **[FMHA Implementation](fmha)** - Step-by-step tutorial building a Flash Multi-Head Attention kernel from pseudocode to optimized cuTile, with companion scripts to run and benchmark The same cuTile code runs on both DGX Spark (sm_121) and B300 (sm_103) - cuTile JIT compiles to the appropriate GPU architecture automatically. ## What you'll accomplish - Run the TileGym benchmark suite on DGX Spark - Run Qwen2-7B or DeepSeek-V2-Lite inference with cuTile-optimized kernels - Observe performance scaling between DGX Spark and B300 - Build an FMHA kernel step-by-step from pseudocode to optimized cuTile implementation ## What to know before starting - Basic familiarity with Docker and command-line tools - Understanding of GPU compute concepts (TFLOPS, memory bandwidth) - No CUDA programming experience required - HuggingFace account with access token (for LLM inference) ## Prerequisites **Hardware Requirements:** - DGX Spark with Ubuntu 24.04 or B300 cloud instance - Minimum 16GB GPU memory for LLM inference - At least 50GB available storage space for model downloads **Software Requirements:** - Docker installed and configured: `docker ps` - CUDA Toolkit 13.x with Tile IR support - HuggingFace token for model access (LLM inference only) - Network access for pulling containers and downloading models Verify Docker is available: ```bash docker ps ``` If you get a permission error: ```bash sudo usermod -aG docker $USER newgrp docker ``` ## Kernel support matrix | Kernel | Category | Data Types | Description | |--------|----------|------------|-------------| | **FMHA** | Attention | float16, float8 | Flash Multi-Head Attention | | **MLA** | Attention | bfloat16, float8 | Multi-head Latent Attention | | **MLA Decoding** | Attention | float16, float8 | MLA for decode phase | | **MatMul** | Matrix Ops | float16, float8 | Matrix multiplication | | **BMM** | Matrix Ops | float16 | Batched matrix multiplication | | **Group GEMM** | Matrix Ops | float16, float8 | Grouped GEMM for MoE | | **RMSNorm** | Normalization | float16, bfloat16 | Root mean square normalization | | **RoPE** | Positional | float16 | Rotary position embedding | | **SiLU** | Activation | float16, float32 | SiLU activation with multiply | | **SwiGLU** | Activation | float16, float32 | SwiGLU fused operation | | **Softmax** | Activation | float16 | Softmax normalization | | **Dropout** | Regularization | float16, float32 | Dropout forward | ## Model support for LLM inference | Model | Supported Kernels | Batch Size | Output Tokens | Notes | |-------|-------------------|------------|---------------|-------| | **Qwen2-7B** | RoPE, RMSNorm, SwiGLU, FMHA | 16 | 50 | Standard transformer | | **DeepSeek-V2-Lite** | RoPE, RMSNorm, SiLU, MLA, MoE | 1 | 100 | MLA attention, MoE layers | ## Ancillary files All required assets can be found in the [TileGym repository](https://github.com/NVIDIA/TileGym). - `tests/benchmark/run_all.sh` - Run all kernel benchmarks - `modeling/transformers/bench_qwen.sh` - Qwen2-7B benchmark script - `modeling/transformers/bench_deepseek.sh` - DeepSeek-V2-Lite benchmark script - `modeling/transformers/infer.py` - Main inference script with TileGym integration - [`assets/fmha_optimization_tutorial.py`](assets/fmha_optimization_tutorial.py) - FMHA step-by-step optimization tutorial - [`assets/fmha_scaling_analysis.py`](assets/fmha_scaling_analysis.py) - FMHA scaling analysis across sequence lengths ## Time & risk * **Estimated time:** 30-45 minutes (including model download for LLM inference) * **Risk level:** Low * Large downloads may fail due to network issues * First run includes JIT compilation overhead * **Rollback:** Remove Docker container to undo all changes * **Last Updated:** February 2026 * First Publication ## Kernel Benchmarks ## Step 1. Pull CUDA NGC container with CTK 13.x ```bash docker pull nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 ``` Launch an interactive session with GPU access: ```bash docker run --gpus all -it --rm \ -v ~/TileGym:/workspace/TileGym \ nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \ /bin/bash ``` > [!NOTE] > The `-v` flag mounts a local directory to persist the TileGym repository. The `--rm` flag automatically removes the container when you exit; omit it if you want to keep the container for later use. Or if running outside a container, install Tile IR directly: ```bash ## Requires root privileges - run with sudo or as root sudo apt-get install cuda-tile-ir-13-1 cuda-compiler-13-1 ``` ## Step 2. Clone TileGym repository ```bash git clone https://github.com/NVIDIA/TileGym cd TileGym pip install . ``` ## Step 3. Run benchmark suite ```bash cd tests/benchmark/ bash run_all.sh ``` > [!NOTE] > The benchmark runs sequentially to ensure accurate timing results. This may take 10-15 minutes to complete all kernels. ## Step 4. View results Results show cuTile performance for each kernel and sequence length. Expected output should look like: ```text ========================================== Running bench_fused_attention.py... ========================================== fused-attention-batch4-head32-d128-fwd-causal=True-float16-TFLOPS: N_CTX CuTile 0 1024.0 58.188262 1 2048.0 80.906892 2 4096.0 86.189532 3 8192.0 88.891086 4 16384.0 89.491869 ✓ PASSED: bench_fused_attention.py ``` ## Step 5. Run individual benchmarks To run specific kernel benchmarks: ```bash ## Flash Multi-Head Attention python bench_fused_attention.py ## Matrix Multiplication python bench_matrix_multiplication.py ## RMSNorm python bench_rmsnorm.py ## RoPE python bench_rope.py ## SwiGLU python bench_swiglu.py ``` ## Step 6. Clean up Exit the container: ```bash exit ``` Remove this workflow's containers (if you ran without `--rm`): ```bash ## Preferred: remove only containers from this workflow's image docker rm $(docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 --format '{{.ID}}') ## Alternative: prune all stopped containers (will prompt for confirmation) ## docker container prune ``` Remove the image (optional): ```bash docker rmi nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 ``` ## Step 7. Repeat on B300 Repeat Steps 1-6 on B300 hardware to observe scaling. See the **Platform Comparison** tab for expected scaling results. ## End-to-End Inference ## Step 1. Set up environment If you haven't already, pull the CUDA container and clone TileGym (see **Kernel Benchmarks** tab for details). First, clone TileGym on the host: ```bash mkdir -p ~/TileGym git clone https://github.com/NVIDIA/TileGym ~/TileGym ``` Then launch the container with the repository mounted: ```bash docker run --gpus all -it --rm \ -v ~/TileGym:/workspace/TileGym \ -v ~/.cache/huggingface:/root/.cache/huggingface \ nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \ /bin/bash ``` > [!NOTE] > The `-v ~/.cache/huggingface:/root/.cache/huggingface` mounts your HuggingFace cache to avoid re-downloading models. Install TileGym inside the container: ```bash cd /workspace/TileGym pip install . ``` Set your HuggingFace token for accessing gated models: ```bash export HF_TOKEN= ``` > [!WARNING] > You need a HuggingFace account and access token. Get one at https://huggingface.co/settings/tokens ## Step 2. Run inference benchmark Navigate to the transformers benchmark directory: ```bash cd modeling/transformers ``` **Option A: Run Qwen2-7B benchmark** ```bash ./bench_qwen.sh ``` Configuration: Model `Qwen/Qwen2-7B`, Batch size 16, Output length 50 tokens. **Option B: Run DeepSeek-V2-Lite benchmark** ```bash ./bench_deepseek.sh ``` Configuration: Model `deepseek-ai/DeepSeek-V2-Lite-Chat`, Batch size 1, Output length 100 tokens. Both scripts run two configurations: 1. **PyTorch baseline** - Standard HuggingFace inference 2. **TileGym cuTile** - With cuTile kernel replacements ## Step 3. View results **Sample DGX Spark (GB10) Results for Qwen2-7B:** ```text ======================================== Benchmark Results ======================================== Qwen2-7B_naive_bfloat16 | 15.66 tokens/s | 51.10s | 51151.0ms CUDA Qwen2-7B_cutile_attn | 18.52 tokens/s | 43.20s | 43079.7ms CUDA ======================================== ``` **cuTile Kernel Breakdown (DGX Spark - Qwen2):** | Kernel | CUDA Time (ms) | Calls | |--------|----------------|-------| | `fmha_kernel` | 4185.9 | 28 | | `swiglu_forward_kernel` | 2459.8 | 1400 | | `attention_decode_kernel_grouped` | 2271.8 | 1372 | | `rms_norm_kernel_static_persistent` | 634.7 | 57 | | `rope_kernel` | 355.6 | 1400 | ## Step 4. How TileGym monkey-patching works TileGym replaces PyTorch model operations with cuTile kernels. The snippet below is taken from TileGym's [`src/tilegym/transformers/monkey_patch.py`](https://github.com/NVIDIA/TileGym/blob/main/src/tilegym/transformers/monkey_patch.py) and invoked from [`modeling/transformers/infer.py`](https://github.com/NVIDIA/TileGym/blob/main/modeling/transformers/infer.py): ```python from tilegym.transformers import apply_tilegym_kernel_to_qwen2 apply_tilegym_kernel_to_qwen2( rope=True, # Replace RoPE with cuTile kernel rms_norm=True, # Replace RMSNorm with cuTile kernel swiglu=True, # Replace SwiGLU with cuTile kernel attn=True, # Replace attention with cuTile FMHA use_cutile=True # Use cuTile backend (vs Triton) ) ``` **Patched Kernels for Qwen2:** | Kernel | PyTorch Operation | cuTile Replacement | |--------|-------------------|-------------------| | `rms_norm_kernel_static_persistent` | `nn.RMSNorm` | Persistent RMSNorm | | `rope_kernel` | Rotary position embedding | Fused RoPE | | `fmha_kernel` | `F.scaled_dot_product_attention` | Flash Attention | | `swiglu_forward_kernel` | SiLU + Mul | Fused SwiGLU | | `attention_decode_kernel_grouped` | Decode attention | Grouped decode | **Patched Kernels for DeepSeek-V2:** (see [`src/tilegym/transformers/monkey_patch.py`](https://github.com/NVIDIA/TileGym/blob/main/src/tilegym/transformers/monkey_patch.py)) ```python from tilegym.transformers import apply_tilegym_kernel_to_deepseek_v2 apply_tilegym_kernel_to_deepseek_v2( rope=True, # Replace RoPE with cuTile kernel rms_norm=True, # Replace RMSNorm with cuTile kernel swiglu=True, # Replace SiLU+Mul with cuTile kernel attn=True, # Replace MLA attention with cuTile moe=True, # Replace MoE routing with cuTile use_cutile=True ) ``` | Kernel | PyTorch Operation | cuTile Replacement | |--------|-------------------|-------------------| | `prefill_mla` | MLA prefill attention | Multi-head Latent Attention | | `_mla_decoding_split_kv` | MLA decode attention | Split-KV decoding | | `fused_moe_kernel` | MoE expert routing | Fused MoE | | `group_gemm_kernel` | Expert FFN | Grouped GEMM | ## Step 5. Platform-specific tuning (Advanced) cuTile exposes two complementary performance-tuning mechanisms: - **[`ct.ByTarget`](https://docs.nvidia.com/cuda/cutile-python/performance.html)** - Select different kernel launch parameters per GPU architecture (`sm_`). The compiler picks the value matching the current target at JIT time; if no entry matches, the `default` value is used. See the [Performance Tuning](https://docs.nvidia.com/cuda/cutile-python/performance.html) and [Execution Model](https://docs.nvidia.com/cuda/cutile-python/execution.html) pages. - **`num_ctas`** - Number of Cooperative Thread Arrays (thread blocks) launched per kernel invocation. Tune to the number of SMs on the target GPU. - **`occupancy`** - Hint for the number of concurrent CTAs the compiler should target per SM. Higher occupancy hides memory latency but increases register/shared-memory pressure. See the [Execution Model](https://docs.nvidia.com/cuda/cutile-python/execution.html) documentation. - **[`ct.autotune`](https://docs.nvidia.com/cuda/cutile-python/performance.html)** - Search a list of candidate values at runtime and pick the fastest configuration. Results are reported via [`cuda.tile.tune.TuningResult`](https://docs.nvidia.com/cuda/cutile-python/generated/cuda.tile.tune.TuningResult.html) / [`Measurement`](https://docs.nvidia.com/cuda/cutile-python/generated/cuda.tile.tune.Measurement.html). ```python import cuda.tile as ct @ct.kernel( # # num_ctas: how many thread blocks to launch. # # Use ByTarget to pick an arch-specific value at JIT time. num_ctas=ct.ByTarget({ "sm_103": 8, # B300 - more SMs, launch more CTAs "sm_121": 4, # DGX Spark - fewer SMs (48), use fewer CTAs "default": 1, # Fallback for any other GPU architecture }), # # occupancy: hint for concurrent CTAs per SM (latency hiding vs. register pressure). occupancy=ct.ByTarget({ "sm_103": 16, # B300 - high occupancy, plenty of registers/SMEM "sm_121": 12, # DGX Spark - moderate occupancy "default": 8, # Conservative fallback }), opt_level=3 # Maximum compiler optimization level ) def optimized_kernel(A, B, C): # # Same kernel code works on all platforms; # # ByTarget swaps in the arch-specific launch params automatically. ... ``` For automatic tuning, use [`ct.autotune`](https://docs.nvidia.com/cuda/cutile-python/performance.html) to search over candidate values and pick the fastest configuration at runtime: ```python @ct.kernel( # # autotune: benchmark each value and pick the fastest. num_ctas=ct.autotune([1, 2, 4, 8, 16]), occupancy=ct.autotune([8, 12, 16, 24]), opt_level=3 ) def autotuned_kernel(A, B, C): ... ``` ## Step 6. Repeat on B300 Repeat Steps 1-3 on B300 hardware. The **same code runs without modification** - cuTile JIT compiles for sm_103 automatically. See the **Platform Comparison** tab for detailed scaling results. ## FMHA Implementation ## FMHA Implementation Guide > [!NOTE] > This is a guide to understanding FMHA implementation in cuTile, not a complete reference. For comprehensive documentation, see the [cuTile Python Documentation](https://docs.nvidia.com/cuda/cutile-python/). ### Attention Basics Attention allows a neural network to focus on relevant parts of the input. In transformers (GPT, LLaMA, Qwen), each position computes how much to attend to every other position using three vectors: - **Query (Q)**: "What am I looking for?" - **Key (K)**: "What do I contain?" - **Value (V)**: "Here is my content" ```text Attention(Q, K, V) = softmax(Q × K^T / √d) × V Shapes: Q, K, V = [batch, heads, seq_len, head_dim] Q × K^T = [batch, heads, seq_len, seq_len] # Attention scores Output = [batch, heads, seq_len, head_dim] ``` For autoregressive models, **causal masking** ensures each token only attends to previous tokens by setting future scores to -infinity before softmax. ### Flash Attention Algorithm Standard attention materializes a [seq_len × seq_len] matrix (e.g., 2 GB for seq_len=32768). Flash Attention avoids this by processing in tiles with **online softmax**: ```text m = -infinity # Running maximum l = 0 # Running sum of exp(x - m) acc = 0 # Running weighted sum of values FOR each K,V tile: scores = Q_tile @ K_tile.T * scale m_new = max(m, max(scores)) correction = exp(m - m_new) l = l * correction + sum(exp(scores - m_new)) acc = acc * correction + exp(scores - m_new) @ V_tile m = m_new output = acc / l ``` ### cuTile Pseudocode → Actual Mapping | Concept | Pseudocode | cuTile | |---|---|---| | Define kernel | `KERNEL fmha(...)` | `@ct.kernel()` | | Get block ID | `block_x = BLOCK_ID_X` | `bid_x = ct.bid(0)` | | Create indices | `range(0, N)` | `ct.arange(N, dtype=ct.int32)` | | Create constant tile | `tile = zeros(M, N)` | `ct.full((M, N), 0.0, dtype)` | | Load from memory | `tile = LOAD(ptr, shape)` | `ct.load(tensor, index, shape)` | | Store to memory | `STORE(ptr, tile)` | `ct.store(tensor, index, tile)` | | Matrix multiply | `C = A @ B + C` | `ct.mma(A, B, C)` | | Reduction | `max_val = MAX(tile, axis)` | `ct.max(tile, axis, keepdims)` | ### Kernel Pseudocode ```text KERNEL fmha(Q, K, V, Out, scale, TILE_M, TILE_N): tile_row = BLOCK_ID_X batch_head = BLOCK_ID_Y batch = batch_head // num_heads head = batch_head % num_heads m_i = full(TILE_M, -infinity) l_i = full(TILE_M, 0) acc = zeros(TILE_M, head_dim) q = LOAD(Q[batch, head, tile_row*TILE_M : (tile_row+1)*TILE_M, :]) FOR j = 0 to num_k_tiles: k = LOAD(K[batch, head, j*TILE_N : (j+1)*TILE_N, :]) v = LOAD(V[batch, head, j*TILE_N : (j+1)*TILE_N, :]) scores = MMA(q, transpose(k)) * scale IF causal AND in_mask_region: scores = WHERE(valid_mask, scores, -infinity) m_new = max(m_i, row_max(scores)) correction = exp(m_i - m_new) p = exp(scores - m_new) l_i = l_i * correction + row_sum(p) acc = acc * correction + MMA(p, v) m_i = m_new out = acc / l_i STORE(Out[batch, head, tile_row*TILE_M :, :], out) ``` ### cuTile Implementation ```python import cuda.tile as ct import math ConstInt = ct.Constant[int] ConstBool = ct.Constant[bool] @ct.kernel() def fmha_kernel(Q, K, V, Out, qk_scale: float, TILE_D: ConstInt, H: ConstInt, TILE_M: ConstInt, TILE_N: ConstInt, CAUSAL: ConstBool): bid_x, bid_y = ct.bid(0), ct.bid(1) batch_idx, head_idx = bid_y // H, bid_y % H offs_m = (bid_x * TILE_M + ct.arange(TILE_M, dtype=ct.int32))[:, None] offs_n_tile = ct.arange(TILE_N, dtype=ct.int32)[None, :] m_i = ct.full((TILE_M, 1), -math.inf, dtype=ct.float32) l_i = ct.full((TILE_M, 1), 0.0, dtype=ct.float32) acc = ct.full((TILE_M, TILE_D), 0.0, dtype=ct.float32) q = ct.load(Q, index=(batch_idx, head_idx, bid_x, 0), shape=(1, 1, TILE_M, TILE_D)).reshape((TILE_M, TILE_D)) k_seqlen = K.shape[2] if CAUSAL: Tc = ct.cdiv(min((bid_x + 1) * TILE_M, k_seqlen), TILE_N) mask_start = (bid_x * TILE_M) // TILE_N else: Tc = ct.cdiv(k_seqlen, TILE_N) mask_start = k_seqlen // TILE_N for j in range(0, Tc): k_tile = ct.load(K, index=(batch_idx, head_idx, j, 0), shape=(1, 1, TILE_N, TILE_D)).reshape((TILE_N, TILE_D)) k_t = ct.permute(k_tile, (1, 0)) qk = ct.mma(q, k_t, ct.full((TILE_M, TILE_N), 0.0, dtype=ct.float32)) qk = qk * qk_scale if CAUSAL and j >= mask_start: offs_n = j * TILE_N + offs_n_tile qk = ct.where(offs_m >= offs_n, qk, ct.full((TILE_M, TILE_N), -math.inf, dtype=ct.float32)) m_ij = ct.maximum(m_i, ct.max(qk, axis=-1, keepdims=True)) qk = qk - m_ij p = ct.exp(qk) alpha = ct.exp(m_i - m_ij) l_i = l_i * alpha + ct.sum(p, axis=-1, keepdims=True) acc = acc * alpha v_tile = ct.load(V, index=(batch_idx, head_idx, j, 0), shape=(1, 1, TILE_N, TILE_D)).reshape((TILE_N, TILE_D)) acc = ct.mma(p.astype(Q.dtype), v_tile, acc) m_i = m_ij acc = (acc / l_i).reshape((1, 1, TILE_M, TILE_D)).astype(Out.dtype) ct.store(Out, index=(batch_idx, head_idx, bid_x, 0), tile=acc) ``` ### Launching the Kernel ```python def run_fmha(q, k, v, sm_scale, is_causal=True): import torch TILE_M, TILE_N = 64, 64 # Platform-specific (see below) batch, num_heads, seq_len, head_dim = q.shape out = torch.empty_like(q) grid = (math.ceil(seq_len / TILE_M), batch * num_heads, 1) ct.launch( torch.cuda.current_stream(), grid, fmha_kernel, (q, k, v, out, sm_scale, head_dim, num_heads, TILE_M, TILE_N, is_causal) ) return out ``` ### Optimizations #### exp2 + flush_to_zero `exp2(x) = 2^x` is faster than `exp(x)` on GPU. Requires scale adjustment by `1/log(2)`. ```python ## Convert natural-exp scale to base-2 so we can use the faster ct.exp2 intrinsic. ## exp(x) == exp2(x / log(2)) == exp2(x * INV_LOG_2). INV_LOG_2 = 1.0 / math.log(2) # ≈ 1.4427 qk_scale_log2 = qk_scale * INV_LOG_2 # Pre-multiply the softmax scale once ## ... in loop: ## Fuse the running-max update with the scale multiplication. m_ij = ct.max(qk, axis=-1, keepdims=True) * qk_scale_log2 ## Subtract the running max for numerical stability (online softmax). qk = qk * qk_scale_log2 - m_ij ## flush_to_zero=True: flush denormals to 0 -> avoids slow denormal handling on GPU. p = ct.exp2(qk, flush_to_zero=True) alpha = ct.exp2(m_i - m_ij, flush_to_zero=True) # Correction factor for previous acc/l_i ``` #### Load Order Transpose Load K already transposed using `order` parameter, avoiding explicit permute. ```python ## order=(0,1,3,2) swaps the last two axes during the load, ## producing K^T directly in registers -- no extra ct.permute() needed. ## shape is expressed in the transposed layout: (1, 1, TILE_D, TILE_N). k_t = ct.load(K, index=(..., 0, j), shape=(1,1,TILE_D,TILE_N), order=(0,1,3,2)).reshape((TILE_D, TILE_N)) ``` #### Latency Hints Prefetch data to overlap memory loads with computation. See the [Performance Tuning docs](https://docs.nvidia.com/cuda/cutile-python/performance.html) for the full list of load/store hints (e.g. `allow_tma`, `latency`). ```python ## latency=N tells the compiler to issue this load N loop iterations in ## advance of its use, so the memory transfer overlaps with the MMA work ## from earlier iterations. Larger latency = deeper software pipeline but ## more register pressure. k_t = ct.load(K, ..., latency=2) # Prefetch K 2 iterations ahead v_tile = ct.load(V, ..., latency=4) # Prefetch V 4 iterations ahead (used later in the loop) ``` #### Occupancy Allow multiple thread blocks per SM to hide memory latency. See the [Execution Model docs](https://docs.nvidia.com/cuda/cutile-python/execution.html) for details on how `occupancy` interacts with registers and shared memory. ```python ## occupancy=N is a hint to the compiler to target N concurrent CTAs per SM. ## Higher occupancy -> more warps available to hide memory latency, ## but constrains the per-CTA register/SMEM budget. @ct.kernel(occupancy=2) # 2 thread blocks (CTAs) co-resident per SM def fmha_optimized(...): ``` #### Approximate Division Use fast approximate division for final normalization. ```python from cuda.tile import RoundingMode as RMd ## RMd.APPROX -> hardware approximate reciprocal/divide (MUFU), much faster ## than IEEE-compliant division. Safe here because it's the final softmax ## normalization step where a small ULP error is acceptable. ## flush_to_zero=True flushes denormals to 0 to avoid the slow path. acc = ct.truediv(acc, l_i, flush_to_zero=True, rounding_mode=RMd.APPROX) ``` ### Platform Configuration The same kernel code works on all platforms; only configuration parameters change. Use [`ct.ByTarget`](https://docs.nvidia.com/cuda/cutile-python/performance.html) to select values per architecture, or [`ct.autotune`](https://docs.nvidia.com/cuda/cutile-python/performance.html) to search candidate values automatically. | Platform | TILE_M | TILE_N | Occupancy | Rationale | |---|---|---|---|---| | DGX Spark (sm_121) | 64 | 64 | 2 | Smaller tiles, higher occupancy for 48 SMs | | B300 (sm_103) | 256 | 128 | 1 | Large tiles maximize HBM3e throughput | | B300 alternate | 128 | 128 | 2 | Higher occupancy, balanced parallelism | ```python import cuda.tile as ct @ct.kernel( # # TILE_M / TILE_N: rows/cols of the Q and K/V tiles processed per CTA. # # Larger tiles -> more arithmetic intensity; smaller tiles -> higher occupancy. # # occupancy: target concurrent CTAs per SM (latency hiding vs. register pressure). occupancy=ct.ByTarget({ "sm_121": 2, # DGX Spark (48 SMs): 2 CTAs/SM for latency hiding "sm_100": 1, # B300: larger tiles already saturate the SM "default": 1, # Conservative fallback for other architectures }), opt_level=3 # Maximum compiler optimization level ) def fmha_kernel(...): ... ``` ### Performance Results > **Note:** PyTorch SDPA is used for correctness verification only, not performance comparison. #### DGX Spark (sm_121) — Seq 2048 | Step | Optimization | Latency (ms) | TFLOPS | |---|---|---|---| | 1 | Basic cuTile | 2.19 | 62.8 | | 2 | + exp2 | 2.07 | 66.5 | | 3 | + Load Order | 2.07 | 66.3 | | 4 | + Latency Hints | 2.07 | 66.5 | | 5 | + Occupancy=2 | 1.73 | 79.5 | | 6 | + Approx Div (Final) | 1.69 | 81.1 | #### B300 (sm_103) — Various Seq Lengths | Seq Len | Latency (ms) | TFLOPS | vs Spark | |---|---|---|---| | 1024 | 0.074 | 465 | 5.7x | | 2048 | 0.178 | 770 | 9.5x | | 4096 | 0.550 | 999 | 15.1x | | 8192 | 1.897 | 1159 | 14.6x | | 16384 | 7.014 | 1254 | 14.2x | ### Common Issues | Issue | Solution | |---|---| | Shape mismatch in ct.mma | Ensure A is (M,K), B is (K,N), C is (M,N) | | dtype errors | Use `.astype()` before mma; accumulator should be float32 | | Incorrect results with causal | Check mask_start calculation and `offs_m >= offs_n` logic | | Low performance | Try different TILE_M/N, check occupancy, verify latency hints | ### Companion Scripts The following scripts are included in this playbook and can be run on DGX Spark or B300: - **[`assets/fmha_optimization_tutorial.py`](assets/fmha_optimization_tutorial.py)** — Step-by-step optimization tutorial. Builds the FMHA kernel from basic to fully optimized, matching the progression in this guide. - **[`assets/fmha_scaling_analysis.py`](assets/fmha_scaling_analysis.py)** — Scaling analysis across sequence lengths. Benchmarks each optimization level and generates performance data. ```bash ## Run the optimization tutorial (DGX Spark) python assets/fmha_optimization_tutorial.py --correctness-check ## Run the scaling analysis python assets/fmha_scaling_analysis.py --iterations 100 ``` ### References - [cuTile Python Documentation](https://docs.nvidia.com/cuda/cutile-python/) - [Tile IR Specification](https://docs.nvidia.com/cuda/tile-ir/) - [TileGym (pre-optimized kernels)](https://github.com/NVIDIA/TileGym) - [NVIDIA Blog: Tuning Flash Attention for Peak Performance in CUDA Tile](https://developer.nvidia.com/blog/tuning-flash-attention-for-peak-performance-in-nvidia-cuda-tile/) - [Flash Attention Paper](https://arxiv.org/abs/2205.14135) ## Platform Comparison ## DGX Spark vs B300 Performance Comparison This page summarizes performance scaling between DGX Spark (GB10) and B300 for both kernel benchmarks and end-to-end LLM inference. ## Kernel Benchmark Scaling Use the ratios below as a reference for how kernel performance scales from DGX Spark (GB10) to B300. | Kernel | Metric | B300 / GB10 | |--------|--------|-------------| | FMHA (causal, 8192) | TFLOPS | 13.7x | | FMHA (non-causal, 8192) | TFLOPS | 15.1x | | MatMul (8192) | TFLOPS | 18.9x | | BMM (batch8, 4096) | TFLOPS | 19.4x | | Group GEMM (4096) | TFLOPS | 23.9x | | RMSNorm (4096) | GB/s | 33.1x | | RoPE (16384) | GB/s | 22.8x | **Key Observations:** - Compute-heavy kernels typically scale 14-24x from GB10 to B300 - Memory-bound kernels can scale 20-33x due to HBM bandwidth advantage ## Qwen2-7B Performance ### End-to-End Throughput | Configuration | DGX Spark | B300 | Platform Speedup | |---------------|-----------|------|------------------| | **cuTile** | 18.52 tok/s | 257.33 tok/s | **13.9x** | ### CUDA Kernel Time | Configuration | DGX Spark | B300 | Platform Speedup | |---------------|-----------|------|------------------| | **cuTile** | 43,080 ms | 2,954 ms | **14.6x** | ### cuTile Kernel Breakdown **DGX Spark (GB10):** | Kernel | CUDA Time (ms) | Calls | |--------|----------------|-------| | `fmha_kernel` | 4,185.9 | 28 | | `swiglu_forward_kernel` | 2,459.8 | 1,400 | | `attention_decode_kernel_grouped` | 2,271.8 | 1,372 | | `rms_norm_kernel_static_persistent` | 634.7 | 57 | | `rope_kernel` | 355.6 | 1,400 | **B300:** | Kernel | CUDA Time (ms) | Speedup vs Spark | |--------|----------------|------------------| | `fmha_kernel` | 337.9 | 12.4x | | `swiglu_forward_kernel` | 226.3 | 10.9x | | `attention_decode_kernel_grouped` | 111.0 | 20.5x | | `rms_norm_kernel_static_persistent` | 29.7 | 21.4x | | `rope_kernel` | 16.7 | 21.3x | **Same code, different architectures** - cuTile JIT compiles for sm_121 (Spark) and sm_103 (B300) ## Platform Specifications | Specification | DGX Spark (GB10) | B300 | |---------------|------------------|------| | Compute Capability | sm_121 (12.1) | sm_103 (10.3) | | SMs | 48 | 132 | | Memory | 128 GB LPDDR5x | 192 GB HBM3e | | Memory Bandwidth | 273 GB/s | 8 TB/s | ## Troubleshooting | Symptom | Cause | Fix | |---------|-------|-----| | `docker: permission denied` | User not in docker group | `sudo usermod -aG docker $USER && newgrp docker` | | `401 Client Error: Unauthorized` | Missing HuggingFace token | `export HF_TOKEN=` | | `ModuleNotFoundError: tilegym` | TileGym not installed | `cd TileGym && pip install .` | | `RuntimeError: CUDA out of memory` | Model too large | Reduce batch size or use smaller model | | `Killed` during model load | Out of system memory | Clear cache: `sync; echo 3 > /proc/sys/vm/drop_caches` | | Slow first run | JIT compilation | Normal - cuTile compiles kernels on first run | | `FileNotFoundError: input_prompt_small.txt` | Missing input file | Run from `modeling/transformers` directory | | `torch.cuda.OutOfMemoryError` | Insufficient GPU memory | Reduce `--batch_size` parameter | | `ImportError: cuda.tile` | Missing Tile IR | Install: `apt-get install cuda-tile-ir-13-1` | | Benchmark hangs | GPU busy or locked | Check `nvidia-smi` for other processes | > [!NOTE] > DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. > With many applications still updating to take advantage of UMA, you may encounter memory issues even when within > the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with: ```bash sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' ``` > [!TIP] > First run of cuTile kernels includes JIT compilation overhead. Subsequent runs will be faster as compiled kernels are cached. For the latest known issues, please review the [DGX Spark User Guide](https://docs.nvidia.com/dgx/dgx-spark/known-issues.html).