| .. | ||
| assets | ||
| README.md | ||
cuTile Kernels
Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300
Table of Contents
- Overview
- Kernel Benchmarks
- End-to-End Inference
- FMHA Implementation
- Platform Comparison
- Troubleshooting
Overview
Basic idea
TileGym is NVIDIA's benchmark suite and integration framework for cuTile kernels - high-performance GPU kernels written using the cuTile Python DSL. cuTile compiles to Tile IR, enabling developers to write efficient kernels without low-level CUDA programming.
This playbook covers three workflows:
- Kernel Benchmarks - Run standalone cuTile kernel benchmarks (FMHA, MatMul, RMSNorm, etc.)
- End-to-End Inference - Run LLM inference with cuTile-optimized kernels via monkey-patching
- FMHA Implementation - Step-by-step tutorial building a Flash Multi-Head Attention kernel from pseudocode to optimized cuTile, with companion scripts to run and benchmark
The same cuTile code runs on both DGX Spark (sm_121) and B300 (sm_103) - cuTile JIT compiles to the appropriate GPU architecture automatically.
What you'll accomplish
- Run the TileGym benchmark suite on DGX Spark
- Run Qwen2-7B or DeepSeek-V2-Lite inference with cuTile-optimized kernels
- Observe performance scaling between DGX Spark and B300
- Build an FMHA kernel step-by-step from pseudocode to optimized cuTile implementation
What to know before starting
- Basic familiarity with Docker and command-line tools
- Understanding of GPU compute concepts (TFLOPS, memory bandwidth)
- No CUDA programming experience required
- HuggingFace account with access token (for LLM inference)
Prerequisites
Hardware Requirements:
- DGX Spark with Ubuntu 24.04 or B300 cloud instance
- Minimum 16GB GPU memory for LLM inference
- At least 50GB available storage space for model downloads
Software Requirements:
- Docker installed and configured:
docker ps - CUDA Toolkit 13.x with Tile IR support
- HuggingFace token for model access (LLM inference only)
- Network access for pulling containers and downloading models
Verify Docker is available:
docker ps
If you get a permission error:
sudo usermod -aG docker $USER
newgrp docker
Kernel support matrix
| Kernel | Category | Data Types | Description |
|---|---|---|---|
| FMHA | Attention | float16, float8 | Flash Multi-Head Attention |
| MLA | Attention | bfloat16, float8 | Multi-head Latent Attention |
| MLA Decoding | Attention | float16, float8 | MLA for decode phase |
| MatMul | Matrix Ops | float16, float8 | Matrix multiplication |
| BMM | Matrix Ops | float16 | Batched matrix multiplication |
| Group GEMM | Matrix Ops | float16, float8 | Grouped GEMM for MoE |
| RMSNorm | Normalization | float16, bfloat16 | Root mean square normalization |
| RoPE | Positional | float16 | Rotary position embedding |
| SiLU | Activation | float16, float32 | SiLU activation with multiply |
| SwiGLU | Activation | float16, float32 | SwiGLU fused operation |
| Softmax | Activation | float16 | Softmax normalization |
| Dropout | Regularization | float16, float32 | Dropout forward |
Model support for LLM inference
| Model | Supported Kernels | Batch Size | Output Tokens | Notes |
|---|---|---|---|---|
| Qwen2-7B | RoPE, RMSNorm, SwiGLU, FMHA | 16 | 50 | Standard transformer |
| DeepSeek-V2-Lite | RoPE, RMSNorm, SiLU, MLA, MoE | 1 | 100 | MLA attention, MoE layers |
Ancillary files
All required assets can be found in the TileGym repository.
tests/benchmark/run_all.sh- Run all kernel benchmarksmodeling/transformers/bench_qwen.sh- Qwen2-7B benchmark scriptmodeling/transformers/bench_deepseek.sh- DeepSeek-V2-Lite benchmark scriptmodeling/transformers/infer.py- Main inference script with TileGym integrationassets/fmha_optimization_tutorial.py- FMHA step-by-step optimization tutorialassets/fmha_scaling_analysis.py- FMHA scaling analysis across sequence lengths
Time & risk
- Estimated time: 30-45 minutes (including model download for LLM inference)
- Risk level: Low
- Large downloads may fail due to network issues
- First run includes JIT compilation overhead
- Rollback: Remove Docker container to undo all changes
- Last Updated: February 2026
- First Publication
Kernel Benchmarks
Step 1. Pull CUDA NGC container with CTK 13.x
docker pull nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04
Launch an interactive session with GPU access:
docker run --gpus all -it --rm \
-v ~/TileGym:/workspace/TileGym \
nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \
/bin/bash
Note
The
-vflag mounts a local directory to persist the TileGym repository. The--rmflag automatically removes the container when you exit; omit it if you want to keep the container for later use.
Or if running outside a container, install Tile IR directly:
## Requires root privileges - run with sudo or as root
sudo apt-get install cuda-tile-ir-13-1 cuda-compiler-13-1
Step 2. Clone TileGym repository
git clone https://github.com/NVIDIA/TileGym
cd TileGym
pip install .
Step 3. Run benchmark suite
cd tests/benchmark/
bash run_all.sh
Note
The benchmark runs sequentially to ensure accurate timing results. This may take 10-15 minutes to complete all kernels.
Step 4. View results
Results show cuTile performance for each kernel and sequence length.
Expected output should look like:
==========================================
Running bench_fused_attention.py...
==========================================
fused-attention-batch4-head32-d128-fwd-causal=True-float16-TFLOPS:
N_CTX CuTile
0 1024.0 58.188262
1 2048.0 80.906892
2 4096.0 86.189532
3 8192.0 88.891086
4 16384.0 89.491869
✓ PASSED: bench_fused_attention.py
Step 5. Run individual benchmarks
To run specific kernel benchmarks:
## Flash Multi-Head Attention
python bench_fused_attention.py
## Matrix Multiplication
python bench_matrix_multiplication.py
## RMSNorm
python bench_rmsnorm.py
## RoPE
python bench_rope.py
## SwiGLU
python bench_swiglu.py
Step 6. Clean up
Exit the container:
exit
Remove this workflow's containers (if you ran without --rm):
## Preferred: remove only containers from this workflow's image
docker rm $(docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 --format '{{.ID}}')
## Alternative: prune all stopped containers (will prompt for confirmation)
## docker container prune
Remove the image (optional):
docker rmi nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04
Step 7. Repeat on B300
Repeat Steps 1-6 on B300 hardware to observe scaling. See the Platform Comparison tab for expected scaling results.
End-to-End Inference
Step 1. Set up environment
If you haven't already, pull the CUDA container and clone TileGym (see Kernel Benchmarks tab for details).
First, clone TileGym on the host:
mkdir -p ~/TileGym
git clone https://github.com/NVIDIA/TileGym ~/TileGym
Then launch the container with the repository mounted:
docker run --gpus all -it --rm \
-v ~/TileGym:/workspace/TileGym \
-v ~/.cache/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \
/bin/bash
Note
The
-v ~/.cache/huggingface:/root/.cache/huggingfacemounts your HuggingFace cache to avoid re-downloading models.
Install TileGym inside the container:
cd /workspace/TileGym
pip install .
Set your HuggingFace token for accessing gated models:
export HF_TOKEN=<your_huggingface_token>
Warning
You need a HuggingFace account and access token. Get one at https://huggingface.co/settings/tokens
Step 2. Run inference benchmark
Navigate to the transformers benchmark directory:
cd modeling/transformers
Option A: Run Qwen2-7B benchmark
./bench_qwen.sh
Configuration: Model Qwen/Qwen2-7B, Batch size 16, Output length 50 tokens.
Option B: Run DeepSeek-V2-Lite benchmark
./bench_deepseek.sh
Configuration: Model deepseek-ai/DeepSeek-V2-Lite-Chat, Batch size 1, Output length 100 tokens.
Both scripts run two configurations:
- PyTorch baseline - Standard HuggingFace inference
- TileGym cuTile - With cuTile kernel replacements
Step 3. View results
Sample DGX Spark (GB10) Results for Qwen2-7B:
========================================
Benchmark Results
========================================
Qwen2-7B_naive_bfloat16 | 15.66 tokens/s | 51.10s | 51151.0ms CUDA
Qwen2-7B_cutile_attn | 18.52 tokens/s | 43.20s | 43079.7ms CUDA
========================================
cuTile Kernel Breakdown (DGX Spark - Qwen2):
| Kernel | CUDA Time (ms) | Calls |
|---|---|---|
fmha_kernel |
4185.9 | 28 |
swiglu_forward_kernel |
2459.8 | 1400 |
attention_decode_kernel_grouped |
2271.8 | 1372 |
rms_norm_kernel_static_persistent |
634.7 | 57 |
rope_kernel |
355.6 | 1400 |
Step 4. How TileGym monkey-patching works
TileGym replaces PyTorch model operations with cuTile kernels. The snippet below is taken from TileGym's src/tilegym/transformers/monkey_patch.py and invoked from modeling/transformers/infer.py:
from tilegym.transformers import apply_tilegym_kernel_to_qwen2
apply_tilegym_kernel_to_qwen2(
rope=True, # Replace RoPE with cuTile kernel
rms_norm=True, # Replace RMSNorm with cuTile kernel
swiglu=True, # Replace SwiGLU with cuTile kernel
attn=True, # Replace attention with cuTile FMHA
use_cutile=True # Use cuTile backend (vs Triton)
)
Patched Kernels for Qwen2:
| Kernel | PyTorch Operation | cuTile Replacement |
|---|---|---|
rms_norm_kernel_static_persistent |
nn.RMSNorm |
Persistent RMSNorm |
rope_kernel |
Rotary position embedding | Fused RoPE |
fmha_kernel |
F.scaled_dot_product_attention |
Flash Attention |
swiglu_forward_kernel |
SiLU + Mul | Fused SwiGLU |
attention_decode_kernel_grouped |
Decode attention | Grouped decode |
Patched Kernels for DeepSeek-V2: (see src/tilegym/transformers/monkey_patch.py)
from tilegym.transformers import apply_tilegym_kernel_to_deepseek_v2
apply_tilegym_kernel_to_deepseek_v2(
rope=True, # Replace RoPE with cuTile kernel
rms_norm=True, # Replace RMSNorm with cuTile kernel
swiglu=True, # Replace SiLU+Mul with cuTile kernel
attn=True, # Replace MLA attention with cuTile
moe=True, # Replace MoE routing with cuTile
use_cutile=True
)
| Kernel | PyTorch Operation | cuTile Replacement |
|---|---|---|
prefill_mla |
MLA prefill attention | Multi-head Latent Attention |
_mla_decoding_split_kv |
MLA decode attention | Split-KV decoding |
fused_moe_kernel |
MoE expert routing | Fused MoE |
group_gemm_kernel |
Expert FFN | Grouped GEMM |
Step 5. Platform-specific tuning (Advanced)
cuTile exposes two complementary performance-tuning mechanisms:
ct.ByTarget- Select different kernel launch parameters per GPU architecture (sm_<major><minor>). The compiler picks the value matching the current target at JIT time; if no entry matches, thedefaultvalue is used. See the Performance Tuning and Execution Model pages.num_ctas- Number of Cooperative Thread Arrays (thread blocks) launched per kernel invocation. Tune to the number of SMs on the target GPU.occupancy- Hint for the number of concurrent CTAs the compiler should target per SM. Higher occupancy hides memory latency but increases register/shared-memory pressure. See the Execution Model documentation.ct.autotune- Search a list of candidate values at runtime and pick the fastest configuration. Results are reported viacuda.tile.tune.TuningResult/Measurement.
import cuda.tile as ct
@ct.kernel(
# # num_ctas: how many thread blocks to launch.
# # Use ByTarget to pick an arch-specific value at JIT time.
num_ctas=ct.ByTarget({
"sm_103": 8, # B300 - more SMs, launch more CTAs
"sm_121": 4, # DGX Spark - fewer SMs (48), use fewer CTAs
"default": 1, # Fallback for any other GPU architecture
}),
# # occupancy: hint for concurrent CTAs per SM (latency hiding vs. register pressure).
occupancy=ct.ByTarget({
"sm_103": 16, # B300 - high occupancy, plenty of registers/SMEM
"sm_121": 12, # DGX Spark - moderate occupancy
"default": 8, # Conservative fallback
}),
opt_level=3 # Maximum compiler optimization level
)
def optimized_kernel(A, B, C):
# # Same kernel code works on all platforms;
# # ByTarget swaps in the arch-specific launch params automatically.
...
For automatic tuning, use ct.autotune to search over candidate values and pick the fastest configuration at runtime:
@ct.kernel(
# # autotune: benchmark each value and pick the fastest.
num_ctas=ct.autotune([1, 2, 4, 8, 16]),
occupancy=ct.autotune([8, 12, 16, 24]),
opt_level=3
)
def autotuned_kernel(A, B, C):
...
Step 6. Repeat on B300
Repeat Steps 1-3 on B300 hardware. The same code runs without modification - cuTile JIT compiles for sm_103 automatically.
See the Platform Comparison tab for detailed scaling results.
FMHA Implementation
FMHA Implementation Guide
Note
This is a guide to understanding FMHA implementation in cuTile, not a complete reference. For comprehensive documentation, see the cuTile Python Documentation.
Attention Basics
Attention allows a neural network to focus on relevant parts of the input. In transformers (GPT, LLaMA, Qwen), each position computes how much to attend to every other position using three vectors:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "Here is my content"
Attention(Q, K, V) = softmax(Q × K^T / √d) × V
Shapes:
Q, K, V = [batch, heads, seq_len, head_dim]
Q × K^T = [batch, heads, seq_len, seq_len] # Attention scores
Output = [batch, heads, seq_len, head_dim]
For autoregressive models, causal masking ensures each token only attends to previous tokens by setting future scores to -infinity before softmax.
Flash Attention Algorithm
Standard attention materializes a [seq_len × seq_len] matrix (e.g., 2 GB for seq_len=32768). Flash Attention avoids this by processing in tiles with online softmax:
m = -infinity # Running maximum
l = 0 # Running sum of exp(x - m)
acc = 0 # Running weighted sum of values
FOR each K,V tile:
scores = Q_tile @ K_tile.T * scale
m_new = max(m, max(scores))
correction = exp(m - m_new)
l = l * correction + sum(exp(scores - m_new))
acc = acc * correction + exp(scores - m_new) @ V_tile
m = m_new
output = acc / l
cuTile Pseudocode → Actual Mapping
| Concept | Pseudocode | cuTile |
|---|---|---|
| Define kernel | KERNEL fmha(...) |
@ct.kernel() |
| Get block ID | block_x = BLOCK_ID_X |
bid_x = ct.bid(0) |
| Create indices | range(0, N) |
ct.arange(N, dtype=ct.int32) |
| Create constant tile | tile = zeros(M, N) |
ct.full((M, N), 0.0, dtype) |
| Load from memory | tile = LOAD(ptr, shape) |
ct.load(tensor, index, shape) |
| Store to memory | STORE(ptr, tile) |
ct.store(tensor, index, tile) |
| Matrix multiply | C = A @ B + C |
ct.mma(A, B, C) |
| Reduction | max_val = MAX(tile, axis) |
ct.max(tile, axis, keepdims) |
Kernel Pseudocode
KERNEL fmha(Q, K, V, Out, scale, TILE_M, TILE_N):
tile_row = BLOCK_ID_X
batch_head = BLOCK_ID_Y
batch = batch_head // num_heads
head = batch_head % num_heads
m_i = full(TILE_M, -infinity)
l_i = full(TILE_M, 0)
acc = zeros(TILE_M, head_dim)
q = LOAD(Q[batch, head, tile_row*TILE_M : (tile_row+1)*TILE_M, :])
FOR j = 0 to num_k_tiles:
k = LOAD(K[batch, head, j*TILE_N : (j+1)*TILE_N, :])
v = LOAD(V[batch, head, j*TILE_N : (j+1)*TILE_N, :])
scores = MMA(q, transpose(k)) * scale
IF causal AND in_mask_region:
scores = WHERE(valid_mask, scores, -infinity)
m_new = max(m_i, row_max(scores))
correction = exp(m_i - m_new)
p = exp(scores - m_new)
l_i = l_i * correction + row_sum(p)
acc = acc * correction + MMA(p, v)
m_i = m_new
out = acc / l_i
STORE(Out[batch, head, tile_row*TILE_M :, :], out)
cuTile Implementation
import cuda.tile as ct
import math
ConstInt = ct.Constant[int]
ConstBool = ct.Constant[bool]
@ct.kernel()
def fmha_kernel(Q, K, V, Out, qk_scale: float, TILE_D: ConstInt, H: ConstInt,
TILE_M: ConstInt, TILE_N: ConstInt, CAUSAL: ConstBool):
bid_x, bid_y = ct.bid(0), ct.bid(1)
batch_idx, head_idx = bid_y // H, bid_y % H
offs_m = (bid_x * TILE_M + ct.arange(TILE_M, dtype=ct.int32))[:, None]
offs_n_tile = ct.arange(TILE_N, dtype=ct.int32)[None, :]
m_i = ct.full((TILE_M, 1), -math.inf, dtype=ct.float32)
l_i = ct.full((TILE_M, 1), 0.0, dtype=ct.float32)
acc = ct.full((TILE_M, TILE_D), 0.0, dtype=ct.float32)
q = ct.load(Q, index=(batch_idx, head_idx, bid_x, 0),
shape=(1, 1, TILE_M, TILE_D)).reshape((TILE_M, TILE_D))
k_seqlen = K.shape[2]
if CAUSAL:
Tc = ct.cdiv(min((bid_x + 1) * TILE_M, k_seqlen), TILE_N)
mask_start = (bid_x * TILE_M) // TILE_N
else:
Tc = ct.cdiv(k_seqlen, TILE_N)
mask_start = k_seqlen // TILE_N
for j in range(0, Tc):
k_tile = ct.load(K, index=(batch_idx, head_idx, j, 0),
shape=(1, 1, TILE_N, TILE_D)).reshape((TILE_N, TILE_D))
k_t = ct.permute(k_tile, (1, 0))
qk = ct.mma(q, k_t, ct.full((TILE_M, TILE_N), 0.0, dtype=ct.float32))
qk = qk * qk_scale
if CAUSAL and j >= mask_start:
offs_n = j * TILE_N + offs_n_tile
qk = ct.where(offs_m >= offs_n, qk,
ct.full((TILE_M, TILE_N), -math.inf, dtype=ct.float32))
m_ij = ct.maximum(m_i, ct.max(qk, axis=-1, keepdims=True))
qk = qk - m_ij
p = ct.exp(qk)
alpha = ct.exp(m_i - m_ij)
l_i = l_i * alpha + ct.sum(p, axis=-1, keepdims=True)
acc = acc * alpha
v_tile = ct.load(V, index=(batch_idx, head_idx, j, 0),
shape=(1, 1, TILE_N, TILE_D)).reshape((TILE_N, TILE_D))
acc = ct.mma(p.astype(Q.dtype), v_tile, acc)
m_i = m_ij
acc = (acc / l_i).reshape((1, 1, TILE_M, TILE_D)).astype(Out.dtype)
ct.store(Out, index=(batch_idx, head_idx, bid_x, 0), tile=acc)
Launching the Kernel
def run_fmha(q, k, v, sm_scale, is_causal=True):
import torch
TILE_M, TILE_N = 64, 64 # Platform-specific (see below)
batch, num_heads, seq_len, head_dim = q.shape
out = torch.empty_like(q)
grid = (math.ceil(seq_len / TILE_M), batch * num_heads, 1)
ct.launch(
torch.cuda.current_stream(), grid, fmha_kernel,
(q, k, v, out, sm_scale, head_dim, num_heads, TILE_M, TILE_N, is_causal)
)
return out
Optimizations
exp2 + flush_to_zero
exp2(x) = 2^x is faster than exp(x) on GPU. Requires scale adjustment by 1/log(2).
## Convert natural-exp scale to base-2 so we can use the faster ct.exp2 intrinsic.
## exp(x) == exp2(x / log(2)) == exp2(x * INV_LOG_2).
INV_LOG_2 = 1.0 / math.log(2) # ≈ 1.4427
qk_scale_log2 = qk_scale * INV_LOG_2 # Pre-multiply the softmax scale once
## ... in loop:
## Fuse the running-max update with the scale multiplication.
m_ij = ct.max(qk, axis=-1, keepdims=True) * qk_scale_log2
## Subtract the running max for numerical stability (online softmax).
qk = qk * qk_scale_log2 - m_ij
## flush_to_zero=True: flush denormals to 0 -> avoids slow denormal handling on GPU.
p = ct.exp2(qk, flush_to_zero=True)
alpha = ct.exp2(m_i - m_ij, flush_to_zero=True) # Correction factor for previous acc/l_i
Load Order Transpose
Load K already transposed using order parameter, avoiding explicit permute.
## order=(0,1,3,2) swaps the last two axes during the load,
## producing K^T directly in registers -- no extra ct.permute() needed.
## shape is expressed in the transposed layout: (1, 1, TILE_D, TILE_N).
k_t = ct.load(K, index=(..., 0, j), shape=(1,1,TILE_D,TILE_N),
order=(0,1,3,2)).reshape((TILE_D, TILE_N))
Latency Hints
Prefetch data to overlap memory loads with computation. See the Performance Tuning docs for the full list of load/store hints (e.g. allow_tma, latency).
## latency=N tells the compiler to issue this load N loop iterations in
## advance of its use, so the memory transfer overlaps with the MMA work
## from earlier iterations. Larger latency = deeper software pipeline but
## more register pressure.
k_t = ct.load(K, ..., latency=2) # Prefetch K 2 iterations ahead
v_tile = ct.load(V, ..., latency=4) # Prefetch V 4 iterations ahead (used later in the loop)
Occupancy
Allow multiple thread blocks per SM to hide memory latency. See the Execution Model docs for details on how occupancy interacts with registers and shared memory.
## occupancy=N is a hint to the compiler to target N concurrent CTAs per SM.
## Higher occupancy -> more warps available to hide memory latency,
## but constrains the per-CTA register/SMEM budget.
@ct.kernel(occupancy=2) # 2 thread blocks (CTAs) co-resident per SM
def fmha_optimized(...):
Approximate Division
Use fast approximate division for final normalization.
from cuda.tile import RoundingMode as RMd
## RMd.APPROX -> hardware approximate reciprocal/divide (MUFU), much faster
## than IEEE-compliant division. Safe here because it's the final softmax
## normalization step where a small ULP error is acceptable.
## flush_to_zero=True flushes denormals to 0 to avoid the slow path.
acc = ct.truediv(acc, l_i, flush_to_zero=True, rounding_mode=RMd.APPROX)
Platform Configuration
The same kernel code works on all platforms; only configuration parameters change. Use ct.ByTarget to select values per architecture, or ct.autotune to search candidate values automatically.
| Platform | TILE_M | TILE_N | Occupancy | Rationale |
|---|---|---|---|---|
| DGX Spark (sm_121) | 64 | 64 | 2 | Smaller tiles, higher occupancy for 48 SMs |
| B300 (sm_103) | 256 | 128 | 1 | Large tiles maximize HBM3e throughput |
| B300 alternate | 128 | 128 | 2 | Higher occupancy, balanced parallelism |
import cuda.tile as ct
@ct.kernel(
# # TILE_M / TILE_N: rows/cols of the Q and K/V tiles processed per CTA.
# # Larger tiles -> more arithmetic intensity; smaller tiles -> higher occupancy.
# # occupancy: target concurrent CTAs per SM (latency hiding vs. register pressure).
occupancy=ct.ByTarget({
"sm_121": 2, # DGX Spark (48 SMs): 2 CTAs/SM for latency hiding
"sm_100": 1, # B300: larger tiles already saturate the SM
"default": 1, # Conservative fallback for other architectures
}),
opt_level=3 # Maximum compiler optimization level
)
def fmha_kernel(...):
...
Performance Results
Note: PyTorch SDPA is used for correctness verification only, not performance comparison.
DGX Spark (sm_121) — Seq 2048
| Step | Optimization | Latency (ms) | TFLOPS |
|---|---|---|---|
| 1 | Basic cuTile | 2.19 | 62.8 |
| 2 | + exp2 | 2.07 | 66.5 |
| 3 | + Load Order | 2.07 | 66.3 |
| 4 | + Latency Hints | 2.07 | 66.5 |
| 5 | + Occupancy=2 | 1.73 | 79.5 |
| 6 | + Approx Div (Final) | 1.69 | 81.1 |
B300 (sm_103) — Various Seq Lengths
| Seq Len | Latency (ms) | TFLOPS | vs Spark |
|---|---|---|---|
| 1024 | 0.074 | 465 | 5.7x |
| 2048 | 0.178 | 770 | 9.5x |
| 4096 | 0.550 | 999 | 15.1x |
| 8192 | 1.897 | 1159 | 14.6x |
| 16384 | 7.014 | 1254 | 14.2x |
Common Issues
| Issue | Solution |
|---|---|
| Shape mismatch in ct.mma | Ensure A is (M,K), B is (K,N), C is (M,N) |
| dtype errors | Use .astype() before mma; accumulator should be float32 |
| Incorrect results with causal | Check mask_start calculation and offs_m >= offs_n logic |
| Low performance | Try different TILE_M/N, check occupancy, verify latency hints |
Companion Scripts
The following scripts are included in this playbook and can be run on DGX Spark or B300:
assets/fmha_optimization_tutorial.py— Step-by-step optimization tutorial. Builds the FMHA kernel from basic to fully optimized, matching the progression in this guide.assets/fmha_scaling_analysis.py— Scaling analysis across sequence lengths. Benchmarks each optimization level and generates performance data.
## Run the optimization tutorial (DGX Spark)
python assets/fmha_optimization_tutorial.py --correctness-check
## Run the scaling analysis
python assets/fmha_scaling_analysis.py --iterations 100
References
- cuTile Python Documentation
- Tile IR Specification
- TileGym (pre-optimized kernels)
- NVIDIA Blog: Tuning Flash Attention for Peak Performance in CUDA Tile
- Flash Attention Paper
Platform Comparison
DGX Spark vs B300 Performance Comparison
This page summarizes performance scaling between DGX Spark (GB10) and B300 for both kernel benchmarks and end-to-end LLM inference.
Kernel Benchmark Scaling
Use the ratios below as a reference for how kernel performance scales from DGX Spark (GB10) to B300.
| Kernel | Metric | B300 / GB10 |
|---|---|---|
| FMHA (causal, 8192) | TFLOPS | 13.7x |
| FMHA (non-causal, 8192) | TFLOPS | 15.1x |
| MatMul (8192) | TFLOPS | 18.9x |
| BMM (batch8, 4096) | TFLOPS | 19.4x |
| Group GEMM (4096) | TFLOPS | 23.9x |
| RMSNorm (4096) | GB/s | 33.1x |
| RoPE (16384) | GB/s | 22.8x |
Key Observations:
- Compute-heavy kernels typically scale 14-24x from GB10 to B300
- Memory-bound kernels can scale 20-33x due to HBM bandwidth advantage
Qwen2-7B Performance
End-to-End Throughput
| Configuration | DGX Spark | B300 | Platform Speedup |
|---|---|---|---|
| cuTile | 18.52 tok/s | 257.33 tok/s | 13.9x |
CUDA Kernel Time
| Configuration | DGX Spark | B300 | Platform Speedup |
|---|---|---|---|
| cuTile | 43,080 ms | 2,954 ms | 14.6x |
cuTile Kernel Breakdown
DGX Spark (GB10):
| Kernel | CUDA Time (ms) | Calls |
|---|---|---|
fmha_kernel |
4,185.9 | 28 |
swiglu_forward_kernel |
2,459.8 | 1,400 |
attention_decode_kernel_grouped |
2,271.8 | 1,372 |
rms_norm_kernel_static_persistent |
634.7 | 57 |
rope_kernel |
355.6 | 1,400 |
B300:
| Kernel | CUDA Time (ms) | Speedup vs Spark |
|---|---|---|
fmha_kernel |
337.9 | 12.4x |
swiglu_forward_kernel |
226.3 | 10.9x |
attention_decode_kernel_grouped |
111.0 | 20.5x |
rms_norm_kernel_static_persistent |
29.7 | 21.4x |
rope_kernel |
16.7 | 21.3x |
Same code, different architectures - cuTile JIT compiles for sm_121 (Spark) and sm_103 (B300)
Platform Specifications
| Specification | DGX Spark (GB10) | B300 |
|---|---|---|
| Compute Capability | sm_121 (12.1) | sm_103 (10.3) |
| SMs | 48 | 132 |
| Memory | 128 GB LPDDR5x | 192 GB HBM3e |
| Memory Bandwidth | 273 GB/s | 8 TB/s |
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
docker: permission denied |
User not in docker group | sudo usermod -aG docker $USER && newgrp docker |
401 Client Error: Unauthorized |
Missing HuggingFace token | export HF_TOKEN=<your_token> |
ModuleNotFoundError: tilegym |
TileGym not installed | cd TileGym && pip install . |
RuntimeError: CUDA out of memory |
Model too large | Reduce batch size or use smaller model |
Killed during model load |
Out of system memory | Clear cache: sync; echo 3 > /proc/sys/vm/drop_caches |
| Slow first run | JIT compilation | Normal - cuTile compiles kernels on first run |
FileNotFoundError: input_prompt_small.txt |
Missing input file | Run from modeling/transformers directory |
torch.cuda.OutOfMemoryError |
Insufficient GPU memory | Reduce --batch_size parameter |
ImportError: cuda.tile |
Missing Tile IR | Install: apt-get install cuda-tile-ir-13-1 |
| Benchmark hangs | GPU busy or locked | Check nvidia-smi for other processes |
Note
DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
Tip
First run of cuTile kernels includes JIT compilation overhead. Subsequent runs will be faster as compiled kernels are cached.
For the latest known issues, please review the DGX Spark User Guide.