mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-06-18 04:22:21 +00:00
860 lines
31 KiB
Markdown
860 lines
31 KiB
Markdown
|
|
# cuTile Kernels
|
|||
|
|
|
|||
|
|
> Run cuTile kernel benchmarks, FMHA implementation, and LLM inference on DGX Spark and B300
|
|||
|
|
|
|||
|
|
|
|||
|
|
## Table of Contents
|
|||
|
|
|
|||
|
|
- [Overview](#overview)
|
|||
|
|
- [Kernel Benchmarks](#kernel-benchmarks)
|
|||
|
|
- [End-to-End Inference](#end-to-end-inference)
|
|||
|
|
- [FMHA Implementation](#fmha-implementation)
|
|||
|
|
- [Attention Basics](#attention-basics)
|
|||
|
|
- [Flash Attention Algorithm](#flash-attention-algorithm)
|
|||
|
|
- [cuTile Pseudocode → Actual Mapping](#cutile-pseudocode-actual-mapping)
|
|||
|
|
- [Kernel Pseudocode](#kernel-pseudocode)
|
|||
|
|
- [cuTile Implementation](#cutile-implementation)
|
|||
|
|
- [Launching the Kernel](#launching-the-kernel)
|
|||
|
|
- [Optimizations](#optimizations)
|
|||
|
|
- [Platform Configuration](#platform-configuration)
|
|||
|
|
- [Performance Results](#performance-results)
|
|||
|
|
- [Common Issues](#common-issues)
|
|||
|
|
- [Companion Scripts](#companion-scripts)
|
|||
|
|
- [References](#references)
|
|||
|
|
- [Platform Comparison](#platform-comparison)
|
|||
|
|
- [End-to-End Throughput](#end-to-end-throughput)
|
|||
|
|
- [CUDA Kernel Time](#cuda-kernel-time)
|
|||
|
|
- [cuTile Kernel Breakdown](#cutile-kernel-breakdown)
|
|||
|
|
- [Troubleshooting](#troubleshooting)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Overview
|
|||
|
|
|
|||
|
|
## Basic idea
|
|||
|
|
|
|||
|
|
[TileGym](https://github.com/NVIDIA/TileGym) is NVIDIA's benchmark suite and integration framework for cuTile kernels - high-performance GPU kernels written using the cuTile Python DSL. cuTile compiles to Tile IR, enabling developers to write efficient kernels without low-level CUDA programming.
|
|||
|
|
|
|||
|
|
This playbook covers three workflows:
|
|||
|
|
1. **[Kernel Benchmarks](kernel-benchmarks)** - Run standalone cuTile kernel benchmarks (FMHA, MatMul, RMSNorm, etc.)
|
|||
|
|
2. **[End-to-End Inference](e2e-inference)** - Run LLM inference with cuTile-optimized kernels via monkey-patching
|
|||
|
|
3. **[FMHA Implementation](fmha)** - Step-by-step tutorial building a Flash Multi-Head Attention kernel from pseudocode to optimized cuTile, with companion scripts to run and benchmark
|
|||
|
|
|
|||
|
|
The same cuTile code runs on both DGX Spark (sm_121) and B300 (sm_103) - cuTile JIT compiles to the appropriate GPU architecture automatically.
|
|||
|
|
|
|||
|
|
## What you'll accomplish
|
|||
|
|
|
|||
|
|
- Run the TileGym benchmark suite on DGX Spark
|
|||
|
|
- Run Qwen2-7B or DeepSeek-V2-Lite inference with cuTile-optimized kernels
|
|||
|
|
- Observe performance scaling between DGX Spark and B300
|
|||
|
|
- Build an FMHA kernel step-by-step from pseudocode to optimized cuTile implementation
|
|||
|
|
|
|||
|
|
## What to know before starting
|
|||
|
|
|
|||
|
|
- Basic familiarity with Docker and command-line tools
|
|||
|
|
- Understanding of GPU compute concepts (TFLOPS, memory bandwidth)
|
|||
|
|
- No CUDA programming experience required
|
|||
|
|
- HuggingFace account with access token (for LLM inference)
|
|||
|
|
|
|||
|
|
## Prerequisites
|
|||
|
|
|
|||
|
|
**Hardware Requirements:**
|
|||
|
|
- DGX Spark with Ubuntu 24.04 or B300 cloud instance
|
|||
|
|
- Minimum 16GB GPU memory for LLM inference
|
|||
|
|
- At least 50GB available storage space for model downloads
|
|||
|
|
|
|||
|
|
**Software Requirements:**
|
|||
|
|
- Docker installed and configured: `docker ps`
|
|||
|
|
- CUDA Toolkit 13.x with Tile IR support
|
|||
|
|
- HuggingFace token for model access (LLM inference only)
|
|||
|
|
- Network access for pulling containers and downloading models
|
|||
|
|
|
|||
|
|
Verify Docker is available:
|
|||
|
|
```bash
|
|||
|
|
docker ps
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
If you get a permission error:
|
|||
|
|
```bash
|
|||
|
|
sudo usermod -aG docker $USER
|
|||
|
|
newgrp docker
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Kernel support matrix
|
|||
|
|
|
|||
|
|
| Kernel | Category | Data Types | Description |
|
|||
|
|
|--------|----------|------------|-------------|
|
|||
|
|
| **FMHA** | Attention | float16, float8 | Flash Multi-Head Attention |
|
|||
|
|
| **MLA** | Attention | bfloat16, float8 | Multi-head Latent Attention |
|
|||
|
|
| **MLA Decoding** | Attention | float16, float8 | MLA for decode phase |
|
|||
|
|
| **MatMul** | Matrix Ops | float16, float8 | Matrix multiplication |
|
|||
|
|
| **BMM** | Matrix Ops | float16 | Batched matrix multiplication |
|
|||
|
|
| **Group GEMM** | Matrix Ops | float16, float8 | Grouped GEMM for MoE |
|
|||
|
|
| **RMSNorm** | Normalization | float16, bfloat16 | Root mean square normalization |
|
|||
|
|
| **RoPE** | Positional | float16 | Rotary position embedding |
|
|||
|
|
| **SiLU** | Activation | float16, float32 | SiLU activation with multiply |
|
|||
|
|
| **SwiGLU** | Activation | float16, float32 | SwiGLU fused operation |
|
|||
|
|
| **Softmax** | Activation | float16 | Softmax normalization |
|
|||
|
|
| **Dropout** | Regularization | float16, float32 | Dropout forward |
|
|||
|
|
|
|||
|
|
## Model support for LLM inference
|
|||
|
|
|
|||
|
|
| Model | Supported Kernels | Batch Size | Output Tokens | Notes |
|
|||
|
|
|-------|-------------------|------------|---------------|-------|
|
|||
|
|
| **Qwen2-7B** | RoPE, RMSNorm, SwiGLU, FMHA | 16 | 50 | Standard transformer |
|
|||
|
|
| **DeepSeek-V2-Lite** | RoPE, RMSNorm, SiLU, MLA, MoE | 1 | 100 | MLA attention, MoE layers |
|
|||
|
|
|
|||
|
|
## Ancillary files
|
|||
|
|
|
|||
|
|
All required assets can be found in the [TileGym repository](https://github.com/NVIDIA/TileGym).
|
|||
|
|
|
|||
|
|
- `tests/benchmark/run_all.sh` - Run all kernel benchmarks
|
|||
|
|
- `modeling/transformers/bench_qwen.sh` - Qwen2-7B benchmark script
|
|||
|
|
- `modeling/transformers/bench_deepseek.sh` - DeepSeek-V2-Lite benchmark script
|
|||
|
|
- `modeling/transformers/infer.py` - Main inference script with TileGym integration
|
|||
|
|
- [`assets/fmha_optimization_tutorial.py`](assets/fmha_optimization_tutorial.py) - FMHA step-by-step optimization tutorial
|
|||
|
|
- [`assets/fmha_scaling_analysis.py`](assets/fmha_scaling_analysis.py) - FMHA scaling analysis across sequence lengths
|
|||
|
|
|
|||
|
|
## Time & risk
|
|||
|
|
|
|||
|
|
* **Estimated time:** 30-45 minutes (including model download for LLM inference)
|
|||
|
|
* **Risk level:** Low
|
|||
|
|
* Large downloads may fail due to network issues
|
|||
|
|
* First run includes JIT compilation overhead
|
|||
|
|
* **Rollback:** Remove Docker container to undo all changes
|
|||
|
|
* **Last Updated:** February 2026
|
|||
|
|
* First Publication
|
|||
|
|
|
|||
|
|
## Kernel Benchmarks
|
|||
|
|
|
|||
|
|
## Step 1. Pull CUDA NGC container with CTK 13.x
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
docker pull nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Launch an interactive session with GPU access:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
docker run --gpus all -it --rm \
|
|||
|
|
-v ~/TileGym:/workspace/TileGym \
|
|||
|
|
nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \
|
|||
|
|
/bin/bash
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
> [!NOTE]
|
|||
|
|
> The `-v` flag mounts a local directory to persist the TileGym repository. The `--rm` flag automatically removes the container when you exit; omit it if you want to keep the container for later use.
|
|||
|
|
|
|||
|
|
Or if running outside a container, install Tile IR directly:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
## Requires root privileges - run with sudo or as root
|
|||
|
|
sudo apt-get install cuda-tile-ir-13-1 cuda-compiler-13-1
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Step 2. Clone TileGym repository
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git clone https://github.com/NVIDIA/TileGym
|
|||
|
|
cd TileGym
|
|||
|
|
pip install .
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Step 3. Run benchmark suite
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd tests/benchmark/
|
|||
|
|
bash run_all.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
> [!NOTE]
|
|||
|
|
> The benchmark runs sequentially to ensure accurate timing results. This may take 10-15 minutes to complete all kernels.
|
|||
|
|
|
|||
|
|
## Step 4. View results
|
|||
|
|
|
|||
|
|
Results show cuTile performance for each kernel and sequence length.
|
|||
|
|
|
|||
|
|
Expected output should look like:
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
==========================================
|
|||
|
|
Running bench_fused_attention.py...
|
|||
|
|
==========================================
|
|||
|
|
fused-attention-batch4-head32-d128-fwd-causal=True-float16-TFLOPS:
|
|||
|
|
N_CTX CuTile
|
|||
|
|
0 1024.0 58.188262
|
|||
|
|
1 2048.0 80.906892
|
|||
|
|
2 4096.0 86.189532
|
|||
|
|
3 8192.0 88.891086
|
|||
|
|
4 16384.0 89.491869
|
|||
|
|
✓ PASSED: bench_fused_attention.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Step 5. Run individual benchmarks
|
|||
|
|
|
|||
|
|
To run specific kernel benchmarks:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
## Flash Multi-Head Attention
|
|||
|
|
python bench_fused_attention.py
|
|||
|
|
|
|||
|
|
## Matrix Multiplication
|
|||
|
|
python bench_matrix_multiplication.py
|
|||
|
|
|
|||
|
|
## RMSNorm
|
|||
|
|
python bench_rmsnorm.py
|
|||
|
|
|
|||
|
|
## RoPE
|
|||
|
|
python bench_rope.py
|
|||
|
|
|
|||
|
|
## SwiGLU
|
|||
|
|
python bench_swiglu.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Step 6. Clean up
|
|||
|
|
|
|||
|
|
Exit the container:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
exit
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Remove this workflow's containers (if you ran without `--rm`):
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
## Preferred: remove only containers from this workflow's image
|
|||
|
|
docker rm $(docker ps -a --filter ancestor=nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 --format '{{.ID}}')
|
|||
|
|
|
|||
|
|
## Alternative: prune all stopped containers (will prompt for confirmation)
|
|||
|
|
## docker container prune
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Remove the image (optional):
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
docker rmi nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Step 7. Repeat on B300
|
|||
|
|
|
|||
|
|
Repeat Steps 1-6 on B300 hardware to observe scaling. See the **Platform Comparison** tab for expected scaling results.
|
|||
|
|
|
|||
|
|
## End-to-End Inference
|
|||
|
|
|
|||
|
|
## Step 1. Set up environment
|
|||
|
|
|
|||
|
|
If you haven't already, pull the CUDA container and clone TileGym (see **Kernel Benchmarks** tab for details).
|
|||
|
|
|
|||
|
|
First, clone TileGym on the host:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
mkdir -p ~/TileGym
|
|||
|
|
git clone https://github.com/NVIDIA/TileGym ~/TileGym
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Then launch the container with the repository mounted:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
docker run --gpus all -it --rm \
|
|||
|
|
-v ~/TileGym:/workspace/TileGym \
|
|||
|
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
|||
|
|
nvcr.io/nvidia/cuda:13.1-devel-ubuntu24.04 \
|
|||
|
|
/bin/bash
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
> [!NOTE]
|
|||
|
|
> The `-v ~/.cache/huggingface:/root/.cache/huggingface` mounts your HuggingFace cache to avoid re-downloading models.
|
|||
|
|
|
|||
|
|
Install TileGym inside the container:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd /workspace/TileGym
|
|||
|
|
pip install .
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Set your HuggingFace token for accessing gated models:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
export HF_TOKEN=<your_huggingface_token>
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
> [!WARNING]
|
|||
|
|
> You need a HuggingFace account and access token. Get one at https://huggingface.co/settings/tokens
|
|||
|
|
|
|||
|
|
## Step 2. Run inference benchmark
|
|||
|
|
|
|||
|
|
Navigate to the transformers benchmark directory:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd modeling/transformers
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Option A: Run Qwen2-7B benchmark**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
./bench_qwen.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Configuration: Model `Qwen/Qwen2-7B`, Batch size 16, Output length 50 tokens.
|
|||
|
|
|
|||
|
|
**Option B: Run DeepSeek-V2-Lite benchmark**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
./bench_deepseek.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Configuration: Model `deepseek-ai/DeepSeek-V2-Lite-Chat`, Batch size 1, Output length 100 tokens.
|
|||
|
|
|
|||
|
|
Both scripts run two configurations:
|
|||
|
|
1. **PyTorch baseline** - Standard HuggingFace inference
|
|||
|
|
2. **TileGym cuTile** - With cuTile kernel replacements
|
|||
|
|
|
|||
|
|
## Step 3. View results
|
|||
|
|
|
|||
|
|
**Sample DGX Spark (GB10) Results for Qwen2-7B:**
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
========================================
|
|||
|
|
Benchmark Results
|
|||
|
|
========================================
|
|||
|
|
Qwen2-7B_naive_bfloat16 | 15.66 tokens/s | 51.10s | 51151.0ms CUDA
|
|||
|
|
Qwen2-7B_cutile_attn | 18.52 tokens/s | 43.20s | 43079.7ms CUDA
|
|||
|
|
========================================
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**cuTile Kernel Breakdown (DGX Spark - Qwen2):**
|
|||
|
|
|
|||
|
|
| Kernel | CUDA Time (ms) | Calls |
|
|||
|
|
|--------|----------------|-------|
|
|||
|
|
| `fmha_kernel` | 4185.9 | 28 |
|
|||
|
|
| `swiglu_forward_kernel` | 2459.8 | 1400 |
|
|||
|
|
| `attention_decode_kernel_grouped` | 2271.8 | 1372 |
|
|||
|
|
| `rms_norm_kernel_static_persistent` | 634.7 | 57 |
|
|||
|
|
| `rope_kernel` | 355.6 | 1400 |
|
|||
|
|
|
|||
|
|
## Step 4. How TileGym monkey-patching works
|
|||
|
|
|
|||
|
|
TileGym replaces PyTorch model operations with cuTile kernels. The snippet below is taken from TileGym's [`src/tilegym/transformers/monkey_patch.py`](https://github.com/NVIDIA/TileGym/blob/main/src/tilegym/transformers/monkey_patch.py) and invoked from [`modeling/transformers/infer.py`](https://github.com/NVIDIA/TileGym/blob/main/modeling/transformers/infer.py):
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from tilegym.transformers import apply_tilegym_kernel_to_qwen2
|
|||
|
|
|
|||
|
|
apply_tilegym_kernel_to_qwen2(
|
|||
|
|
rope=True, # Replace RoPE with cuTile kernel
|
|||
|
|
rms_norm=True, # Replace RMSNorm with cuTile kernel
|
|||
|
|
swiglu=True, # Replace SwiGLU with cuTile kernel
|
|||
|
|
attn=True, # Replace attention with cuTile FMHA
|
|||
|
|
use_cutile=True # Use cuTile backend (vs Triton)
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Patched Kernels for Qwen2:**
|
|||
|
|
|
|||
|
|
| Kernel | PyTorch Operation | cuTile Replacement |
|
|||
|
|
|--------|-------------------|-------------------|
|
|||
|
|
| `rms_norm_kernel_static_persistent` | `nn.RMSNorm` | Persistent RMSNorm |
|
|||
|
|
| `rope_kernel` | Rotary position embedding | Fused RoPE |
|
|||
|
|
| `fmha_kernel` | `F.scaled_dot_product_attention` | Flash Attention |
|
|||
|
|
| `swiglu_forward_kernel` | SiLU + Mul | Fused SwiGLU |
|
|||
|
|
| `attention_decode_kernel_grouped` | Decode attention | Grouped decode |
|
|||
|
|
|
|||
|
|
**Patched Kernels for DeepSeek-V2:** (see [`src/tilegym/transformers/monkey_patch.py`](https://github.com/NVIDIA/TileGym/blob/main/src/tilegym/transformers/monkey_patch.py))
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from tilegym.transformers import apply_tilegym_kernel_to_deepseek_v2
|
|||
|
|
|
|||
|
|
apply_tilegym_kernel_to_deepseek_v2(
|
|||
|
|
rope=True, # Replace RoPE with cuTile kernel
|
|||
|
|
rms_norm=True, # Replace RMSNorm with cuTile kernel
|
|||
|
|
swiglu=True, # Replace SiLU+Mul with cuTile kernel
|
|||
|
|
attn=True, # Replace MLA attention with cuTile
|
|||
|
|
moe=True, # Replace MoE routing with cuTile
|
|||
|
|
use_cutile=True
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
| Kernel | PyTorch Operation | cuTile Replacement |
|
|||
|
|
|--------|-------------------|-------------------|
|
|||
|
|
| `prefill_mla` | MLA prefill attention | Multi-head Latent Attention |
|
|||
|
|
| `_mla_decoding_split_kv` | MLA decode attention | Split-KV decoding |
|
|||
|
|
| `fused_moe_kernel` | MoE expert routing | Fused MoE |
|
|||
|
|
| `group_gemm_kernel` | Expert FFN | Grouped GEMM |
|
|||
|
|
|
|||
|
|
## Step 5. Platform-specific tuning (Advanced)
|
|||
|
|
|
|||
|
|
cuTile exposes two complementary performance-tuning mechanisms:
|
|||
|
|
|
|||
|
|
- **[`ct.ByTarget`](https://docs.nvidia.com/cuda/cutile-python/performance.html)** - Select different kernel launch parameters per GPU architecture (`sm_<major><minor>`). The compiler picks the value matching the current target at JIT time; if no entry matches, the `default` value is used. See the [Performance Tuning](https://docs.nvidia.com/cuda/cutile-python/performance.html) and [Execution Model](https://docs.nvidia.com/cuda/cutile-python/execution.html) pages.
|
|||
|
|
- **`num_ctas`** - Number of Cooperative Thread Arrays (thread blocks) launched per kernel invocation. Tune to the number of SMs on the target GPU.
|
|||
|
|
- **`occupancy`** - Hint for the number of concurrent CTAs the compiler should target per SM. Higher occupancy hides memory latency but increases register/shared-memory pressure. See the [Execution Model](https://docs.nvidia.com/cuda/cutile-python/execution.html) documentation.
|
|||
|
|
- **[`ct.autotune`](https://docs.nvidia.com/cuda/cutile-python/performance.html)** - Search a list of candidate values at runtime and pick the fastest configuration. Results are reported via [`cuda.tile.tune.TuningResult`](https://docs.nvidia.com/cuda/cutile-python/generated/cuda.tile.tune.TuningResult.html) / [`Measurement`](https://docs.nvidia.com/cuda/cutile-python/generated/cuda.tile.tune.Measurement.html).
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import cuda.tile as ct
|
|||
|
|
|
|||
|
|
@ct.kernel(
|
|||
|
|
# # num_ctas: how many thread blocks to launch.
|
|||
|
|
# # Use ByTarget to pick an arch-specific value at JIT time.
|
|||
|
|
num_ctas=ct.ByTarget({
|
|||
|
|
"sm_103": 8, # B300 - more SMs, launch more CTAs
|
|||
|
|
"sm_121": 4, # DGX Spark - fewer SMs (48), use fewer CTAs
|
|||
|
|
"default": 1, # Fallback for any other GPU architecture
|
|||
|
|
}),
|
|||
|
|
# # occupancy: hint for concurrent CTAs per SM (latency hiding vs. register pressure).
|
|||
|
|
occupancy=ct.ByTarget({
|
|||
|
|
"sm_103": 16, # B300 - high occupancy, plenty of registers/SMEM
|
|||
|
|
"sm_121": 12, # DGX Spark - moderate occupancy
|
|||
|
|
"default": 8, # Conservative fallback
|
|||
|
|
}),
|
|||
|
|
opt_level=3 # Maximum compiler optimization level
|
|||
|
|
)
|
|||
|
|
def optimized_kernel(A, B, C):
|
|||
|
|
# # Same kernel code works on all platforms;
|
|||
|
|
# # ByTarget swaps in the arch-specific launch params automatically.
|
|||
|
|
...
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
For automatic tuning, use [`ct.autotune`](https://docs.nvidia.com/cuda/cutile-python/performance.html) to search over candidate values and pick the fastest configuration at runtime:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
@ct.kernel(
|
|||
|
|
# # autotune: benchmark each value and pick the fastest.
|
|||
|
|
num_ctas=ct.autotune([1, 2, 4, 8, 16]),
|
|||
|
|
occupancy=ct.autotune([8, 12, 16, 24]),
|
|||
|
|
opt_level=3
|
|||
|
|
)
|
|||
|
|
def autotuned_kernel(A, B, C):
|
|||
|
|
...
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Step 6. Repeat on B300
|
|||
|
|
|
|||
|
|
Repeat Steps 1-3 on B300 hardware. The **same code runs without modification** - cuTile JIT compiles for sm_103 automatically.
|
|||
|
|
|
|||
|
|
See the **Platform Comparison** tab for detailed scaling results.
|
|||
|
|
|
|||
|
|
## FMHA Implementation
|
|||
|
|
|
|||
|
|
## FMHA Implementation Guide
|
|||
|
|
|
|||
|
|
> [!NOTE]
|
|||
|
|
> This is a guide to understanding FMHA implementation in cuTile, not a complete reference. For comprehensive documentation, see the [cuTile Python Documentation](https://docs.nvidia.com/cuda/cutile-python/).
|
|||
|
|
|
|||
|
|
### Attention Basics
|
|||
|
|
|
|||
|
|
Attention allows a neural network to focus on relevant parts of the input. In transformers (GPT, LLaMA, Qwen), each position computes how much to attend to every other position using three vectors:
|
|||
|
|
|
|||
|
|
- **Query (Q)**: "What am I looking for?"
|
|||
|
|
- **Key (K)**: "What do I contain?"
|
|||
|
|
- **Value (V)**: "Here is my content"
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
Attention(Q, K, V) = softmax(Q × K^T / √d) × V
|
|||
|
|
|
|||
|
|
Shapes:
|
|||
|
|
Q, K, V = [batch, heads, seq_len, head_dim]
|
|||
|
|
Q × K^T = [batch, heads, seq_len, seq_len] # Attention scores
|
|||
|
|
Output = [batch, heads, seq_len, head_dim]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
For autoregressive models, **causal masking** ensures each token only attends to previous tokens by setting future scores to -infinity before softmax.
|
|||
|
|
|
|||
|
|
### Flash Attention Algorithm
|
|||
|
|
|
|||
|
|
Standard attention materializes a [seq_len × seq_len] matrix (e.g., 2 GB for seq_len=32768). Flash Attention avoids this by processing in tiles with **online softmax**:
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
m = -infinity # Running maximum
|
|||
|
|
l = 0 # Running sum of exp(x - m)
|
|||
|
|
acc = 0 # Running weighted sum of values
|
|||
|
|
|
|||
|
|
FOR each K,V tile:
|
|||
|
|
scores = Q_tile @ K_tile.T * scale
|
|||
|
|
m_new = max(m, max(scores))
|
|||
|
|
correction = exp(m - m_new)
|
|||
|
|
l = l * correction + sum(exp(scores - m_new))
|
|||
|
|
acc = acc * correction + exp(scores - m_new) @ V_tile
|
|||
|
|
m = m_new
|
|||
|
|
|
|||
|
|
output = acc / l
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### cuTile Pseudocode → Actual Mapping
|
|||
|
|
|
|||
|
|
| Concept | Pseudocode | cuTile |
|
|||
|
|
|---|---|---|
|
|||
|
|
| Define kernel | `KERNEL fmha(...)` | `@ct.kernel()` |
|
|||
|
|
| Get block ID | `block_x = BLOCK_ID_X` | `bid_x = ct.bid(0)` |
|
|||
|
|
| Create indices | `range(0, N)` | `ct.arange(N, dtype=ct.int32)` |
|
|||
|
|
| Create constant tile | `tile = zeros(M, N)` | `ct.full((M, N), 0.0, dtype)` |
|
|||
|
|
| Load from memory | `tile = LOAD(ptr, shape)` | `ct.load(tensor, index, shape)` |
|
|||
|
|
| Store to memory | `STORE(ptr, tile)` | `ct.store(tensor, index, tile)` |
|
|||
|
|
| Matrix multiply | `C = A @ B + C` | `ct.mma(A, B, C)` |
|
|||
|
|
| Reduction | `max_val = MAX(tile, axis)` | `ct.max(tile, axis, keepdims)` |
|
|||
|
|
|
|||
|
|
### Kernel Pseudocode
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
KERNEL fmha(Q, K, V, Out, scale, TILE_M, TILE_N):
|
|||
|
|
tile_row = BLOCK_ID_X
|
|||
|
|
batch_head = BLOCK_ID_Y
|
|||
|
|
batch = batch_head // num_heads
|
|||
|
|
head = batch_head % num_heads
|
|||
|
|
|
|||
|
|
m_i = full(TILE_M, -infinity)
|
|||
|
|
l_i = full(TILE_M, 0)
|
|||
|
|
acc = zeros(TILE_M, head_dim)
|
|||
|
|
|
|||
|
|
q = LOAD(Q[batch, head, tile_row*TILE_M : (tile_row+1)*TILE_M, :])
|
|||
|
|
|
|||
|
|
FOR j = 0 to num_k_tiles:
|
|||
|
|
k = LOAD(K[batch, head, j*TILE_N : (j+1)*TILE_N, :])
|
|||
|
|
v = LOAD(V[batch, head, j*TILE_N : (j+1)*TILE_N, :])
|
|||
|
|
scores = MMA(q, transpose(k)) * scale
|
|||
|
|
IF causal AND in_mask_region:
|
|||
|
|
scores = WHERE(valid_mask, scores, -infinity)
|
|||
|
|
m_new = max(m_i, row_max(scores))
|
|||
|
|
correction = exp(m_i - m_new)
|
|||
|
|
p = exp(scores - m_new)
|
|||
|
|
l_i = l_i * correction + row_sum(p)
|
|||
|
|
acc = acc * correction + MMA(p, v)
|
|||
|
|
m_i = m_new
|
|||
|
|
|
|||
|
|
out = acc / l_i
|
|||
|
|
STORE(Out[batch, head, tile_row*TILE_M :, :], out)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### cuTile Implementation
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import cuda.tile as ct
|
|||
|
|
import math
|
|||
|
|
ConstInt = ct.Constant[int]
|
|||
|
|
ConstBool = ct.Constant[bool]
|
|||
|
|
|
|||
|
|
@ct.kernel()
|
|||
|
|
def fmha_kernel(Q, K, V, Out, qk_scale: float, TILE_D: ConstInt, H: ConstInt,
|
|||
|
|
TILE_M: ConstInt, TILE_N: ConstInt, CAUSAL: ConstBool):
|
|||
|
|
bid_x, bid_y = ct.bid(0), ct.bid(1)
|
|||
|
|
batch_idx, head_idx = bid_y // H, bid_y % H
|
|||
|
|
|
|||
|
|
offs_m = (bid_x * TILE_M + ct.arange(TILE_M, dtype=ct.int32))[:, None]
|
|||
|
|
offs_n_tile = ct.arange(TILE_N, dtype=ct.int32)[None, :]
|
|||
|
|
|
|||
|
|
m_i = ct.full((TILE_M, 1), -math.inf, dtype=ct.float32)
|
|||
|
|
l_i = ct.full((TILE_M, 1), 0.0, dtype=ct.float32)
|
|||
|
|
acc = ct.full((TILE_M, TILE_D), 0.0, dtype=ct.float32)
|
|||
|
|
|
|||
|
|
q = ct.load(Q, index=(batch_idx, head_idx, bid_x, 0),
|
|||
|
|
shape=(1, 1, TILE_M, TILE_D)).reshape((TILE_M, TILE_D))
|
|||
|
|
|
|||
|
|
k_seqlen = K.shape[2]
|
|||
|
|
if CAUSAL:
|
|||
|
|
Tc = ct.cdiv(min((bid_x + 1) * TILE_M, k_seqlen), TILE_N)
|
|||
|
|
mask_start = (bid_x * TILE_M) // TILE_N
|
|||
|
|
else:
|
|||
|
|
Tc = ct.cdiv(k_seqlen, TILE_N)
|
|||
|
|
mask_start = k_seqlen // TILE_N
|
|||
|
|
|
|||
|
|
for j in range(0, Tc):
|
|||
|
|
k_tile = ct.load(K, index=(batch_idx, head_idx, j, 0),
|
|||
|
|
shape=(1, 1, TILE_N, TILE_D)).reshape((TILE_N, TILE_D))
|
|||
|
|
k_t = ct.permute(k_tile, (1, 0))
|
|||
|
|
|
|||
|
|
qk = ct.mma(q, k_t, ct.full((TILE_M, TILE_N), 0.0, dtype=ct.float32))
|
|||
|
|
qk = qk * qk_scale
|
|||
|
|
|
|||
|
|
if CAUSAL and j >= mask_start:
|
|||
|
|
offs_n = j * TILE_N + offs_n_tile
|
|||
|
|
qk = ct.where(offs_m >= offs_n, qk,
|
|||
|
|
ct.full((TILE_M, TILE_N), -math.inf, dtype=ct.float32))
|
|||
|
|
|
|||
|
|
m_ij = ct.maximum(m_i, ct.max(qk, axis=-1, keepdims=True))
|
|||
|
|
qk = qk - m_ij
|
|||
|
|
p = ct.exp(qk)
|
|||
|
|
alpha = ct.exp(m_i - m_ij)
|
|||
|
|
l_i = l_i * alpha + ct.sum(p, axis=-1, keepdims=True)
|
|||
|
|
acc = acc * alpha
|
|||
|
|
|
|||
|
|
v_tile = ct.load(V, index=(batch_idx, head_idx, j, 0),
|
|||
|
|
shape=(1, 1, TILE_N, TILE_D)).reshape((TILE_N, TILE_D))
|
|||
|
|
acc = ct.mma(p.astype(Q.dtype), v_tile, acc)
|
|||
|
|
m_i = m_ij
|
|||
|
|
|
|||
|
|
acc = (acc / l_i).reshape((1, 1, TILE_M, TILE_D)).astype(Out.dtype)
|
|||
|
|
ct.store(Out, index=(batch_idx, head_idx, bid_x, 0), tile=acc)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Launching the Kernel
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def run_fmha(q, k, v, sm_scale, is_causal=True):
|
|||
|
|
import torch
|
|||
|
|
TILE_M, TILE_N = 64, 64 # Platform-specific (see below)
|
|||
|
|
batch, num_heads, seq_len, head_dim = q.shape
|
|||
|
|
out = torch.empty_like(q)
|
|||
|
|
grid = (math.ceil(seq_len / TILE_M), batch * num_heads, 1)
|
|||
|
|
ct.launch(
|
|||
|
|
torch.cuda.current_stream(), grid, fmha_kernel,
|
|||
|
|
(q, k, v, out, sm_scale, head_dim, num_heads, TILE_M, TILE_N, is_causal)
|
|||
|
|
)
|
|||
|
|
return out
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Optimizations
|
|||
|
|
|
|||
|
|
#### exp2 + flush_to_zero
|
|||
|
|
|
|||
|
|
`exp2(x) = 2^x` is faster than `exp(x)` on GPU. Requires scale adjustment by `1/log(2)`.
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
## Convert natural-exp scale to base-2 so we can use the faster ct.exp2 intrinsic.
|
|||
|
|
## exp(x) == exp2(x / log(2)) == exp2(x * INV_LOG_2).
|
|||
|
|
INV_LOG_2 = 1.0 / math.log(2) # ≈ 1.4427
|
|||
|
|
qk_scale_log2 = qk_scale * INV_LOG_2 # Pre-multiply the softmax scale once
|
|||
|
|
|
|||
|
|
## ... in loop:
|
|||
|
|
## Fuse the running-max update with the scale multiplication.
|
|||
|
|
m_ij = ct.max(qk, axis=-1, keepdims=True) * qk_scale_log2
|
|||
|
|
## Subtract the running max for numerical stability (online softmax).
|
|||
|
|
qk = qk * qk_scale_log2 - m_ij
|
|||
|
|
## flush_to_zero=True: flush denormals to 0 -> avoids slow denormal handling on GPU.
|
|||
|
|
p = ct.exp2(qk, flush_to_zero=True)
|
|||
|
|
alpha = ct.exp2(m_i - m_ij, flush_to_zero=True) # Correction factor for previous acc/l_i
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Load Order Transpose
|
|||
|
|
|
|||
|
|
Load K already transposed using `order` parameter, avoiding explicit permute.
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
## order=(0,1,3,2) swaps the last two axes during the load,
|
|||
|
|
## producing K^T directly in registers -- no extra ct.permute() needed.
|
|||
|
|
## shape is expressed in the transposed layout: (1, 1, TILE_D, TILE_N).
|
|||
|
|
k_t = ct.load(K, index=(..., 0, j), shape=(1,1,TILE_D,TILE_N),
|
|||
|
|
order=(0,1,3,2)).reshape((TILE_D, TILE_N))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Latency Hints
|
|||
|
|
|
|||
|
|
Prefetch data to overlap memory loads with computation. See the [Performance Tuning docs](https://docs.nvidia.com/cuda/cutile-python/performance.html) for the full list of load/store hints (e.g. `allow_tma`, `latency`).
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
## latency=N tells the compiler to issue this load N loop iterations in
|
|||
|
|
## advance of its use, so the memory transfer overlaps with the MMA work
|
|||
|
|
## from earlier iterations. Larger latency = deeper software pipeline but
|
|||
|
|
## more register pressure.
|
|||
|
|
k_t = ct.load(K, ..., latency=2) # Prefetch K 2 iterations ahead
|
|||
|
|
v_tile = ct.load(V, ..., latency=4) # Prefetch V 4 iterations ahead (used later in the loop)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Occupancy
|
|||
|
|
|
|||
|
|
Allow multiple thread blocks per SM to hide memory latency. See the [Execution Model docs](https://docs.nvidia.com/cuda/cutile-python/execution.html) for details on how `occupancy` interacts with registers and shared memory.
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
## occupancy=N is a hint to the compiler to target N concurrent CTAs per SM.
|
|||
|
|
## Higher occupancy -> more warps available to hide memory latency,
|
|||
|
|
## but constrains the per-CTA register/SMEM budget.
|
|||
|
|
@ct.kernel(occupancy=2) # 2 thread blocks (CTAs) co-resident per SM
|
|||
|
|
def fmha_optimized(...):
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Approximate Division
|
|||
|
|
|
|||
|
|
Use fast approximate division for final normalization.
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from cuda.tile import RoundingMode as RMd
|
|||
|
|
## RMd.APPROX -> hardware approximate reciprocal/divide (MUFU), much faster
|
|||
|
|
## than IEEE-compliant division. Safe here because it's the final softmax
|
|||
|
|
## normalization step where a small ULP error is acceptable.
|
|||
|
|
## flush_to_zero=True flushes denormals to 0 to avoid the slow path.
|
|||
|
|
acc = ct.truediv(acc, l_i, flush_to_zero=True, rounding_mode=RMd.APPROX)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Platform Configuration
|
|||
|
|
|
|||
|
|
The same kernel code works on all platforms; only configuration parameters change. Use [`ct.ByTarget`](https://docs.nvidia.com/cuda/cutile-python/performance.html) to select values per architecture, or [`ct.autotune`](https://docs.nvidia.com/cuda/cutile-python/performance.html) to search candidate values automatically.
|
|||
|
|
|
|||
|
|
| Platform | TILE_M | TILE_N | Occupancy | Rationale |
|
|||
|
|
|---|---|---|---|---|
|
|||
|
|
| DGX Spark (sm_121) | 64 | 64 | 2 | Smaller tiles, higher occupancy for 48 SMs |
|
|||
|
|
| B300 (sm_103) | 256 | 128 | 1 | Large tiles maximize HBM3e throughput |
|
|||
|
|
| B300 alternate | 128 | 128 | 2 | Higher occupancy, balanced parallelism |
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import cuda.tile as ct
|
|||
|
|
|
|||
|
|
@ct.kernel(
|
|||
|
|
# # TILE_M / TILE_N: rows/cols of the Q and K/V tiles processed per CTA.
|
|||
|
|
# # Larger tiles -> more arithmetic intensity; smaller tiles -> higher occupancy.
|
|||
|
|
# # occupancy: target concurrent CTAs per SM (latency hiding vs. register pressure).
|
|||
|
|
occupancy=ct.ByTarget({
|
|||
|
|
"sm_121": 2, # DGX Spark (48 SMs): 2 CTAs/SM for latency hiding
|
|||
|
|
"sm_100": 1, # B300: larger tiles already saturate the SM
|
|||
|
|
"default": 1, # Conservative fallback for other architectures
|
|||
|
|
}),
|
|||
|
|
opt_level=3 # Maximum compiler optimization level
|
|||
|
|
)
|
|||
|
|
def fmha_kernel(...):
|
|||
|
|
...
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Performance Results
|
|||
|
|
|
|||
|
|
> **Note:** PyTorch SDPA is used for correctness verification only, not performance comparison.
|
|||
|
|
|
|||
|
|
#### DGX Spark (sm_121) — Seq 2048
|
|||
|
|
|
|||
|
|
| Step | Optimization | Latency (ms) | TFLOPS |
|
|||
|
|
|---|---|---|---|
|
|||
|
|
| 1 | Basic cuTile | 2.19 | 62.8 |
|
|||
|
|
| 2 | + exp2 | 2.07 | 66.5 |
|
|||
|
|
| 3 | + Load Order | 2.07 | 66.3 |
|
|||
|
|
| 4 | + Latency Hints | 2.07 | 66.5 |
|
|||
|
|
| 5 | + Occupancy=2 | 1.73 | 79.5 |
|
|||
|
|
| 6 | + Approx Div (Final) | 1.69 | 81.1 |
|
|||
|
|
|
|||
|
|
#### B300 (sm_103) — Various Seq Lengths
|
|||
|
|
|
|||
|
|
| Seq Len | Latency (ms) | TFLOPS | vs Spark |
|
|||
|
|
|---|---|---|---|
|
|||
|
|
| 1024 | 0.074 | 465 | 5.7x |
|
|||
|
|
| 2048 | 0.178 | 770 | 9.5x |
|
|||
|
|
| 4096 | 0.550 | 999 | 15.1x |
|
|||
|
|
| 8192 | 1.897 | 1159 | 14.6x |
|
|||
|
|
| 16384 | 7.014 | 1254 | 14.2x |
|
|||
|
|
|
|||
|
|
### Common Issues
|
|||
|
|
|
|||
|
|
| Issue | Solution |
|
|||
|
|
|---|---|
|
|||
|
|
| Shape mismatch in ct.mma | Ensure A is (M,K), B is (K,N), C is (M,N) |
|
|||
|
|
| dtype errors | Use `.astype()` before mma; accumulator should be float32 |
|
|||
|
|
| Incorrect results with causal | Check mask_start calculation and `offs_m >= offs_n` logic |
|
|||
|
|
| Low performance | Try different TILE_M/N, check occupancy, verify latency hints |
|
|||
|
|
|
|||
|
|
### Companion Scripts
|
|||
|
|
|
|||
|
|
The following scripts are included in this playbook and can be run on DGX Spark or B300:
|
|||
|
|
|
|||
|
|
- **[`assets/fmha_optimization_tutorial.py`](assets/fmha_optimization_tutorial.py)** — Step-by-step optimization tutorial. Builds the FMHA kernel from basic to fully optimized, matching the progression in this guide.
|
|||
|
|
- **[`assets/fmha_scaling_analysis.py`](assets/fmha_scaling_analysis.py)** — Scaling analysis across sequence lengths. Benchmarks each optimization level and generates performance data.
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
## Run the optimization tutorial (DGX Spark)
|
|||
|
|
python assets/fmha_optimization_tutorial.py --correctness-check
|
|||
|
|
|
|||
|
|
## Run the scaling analysis
|
|||
|
|
python assets/fmha_scaling_analysis.py --iterations 100
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### References
|
|||
|
|
|
|||
|
|
- [cuTile Python Documentation](https://docs.nvidia.com/cuda/cutile-python/)
|
|||
|
|
- [Tile IR Specification](https://docs.nvidia.com/cuda/tile-ir/)
|
|||
|
|
- [TileGym (pre-optimized kernels)](https://github.com/NVIDIA/TileGym)
|
|||
|
|
- [NVIDIA Blog: Tuning Flash Attention for Peak Performance in CUDA Tile](https://developer.nvidia.com/blog/tuning-flash-attention-for-peak-performance-in-nvidia-cuda-tile/)
|
|||
|
|
- [Flash Attention Paper](https://arxiv.org/abs/2205.14135)
|
|||
|
|
|
|||
|
|
## Platform Comparison
|
|||
|
|
|
|||
|
|
## DGX Spark vs B300 Performance Comparison
|
|||
|
|
|
|||
|
|
This page summarizes performance scaling between DGX Spark (GB10) and B300 for both kernel benchmarks and end-to-end LLM inference.
|
|||
|
|
|
|||
|
|
## Kernel Benchmark Scaling
|
|||
|
|
|
|||
|
|
Use the ratios below as a reference for how kernel performance scales from DGX Spark (GB10) to B300.
|
|||
|
|
|
|||
|
|
| Kernel | Metric | B300 / GB10 |
|
|||
|
|
|--------|--------|-------------|
|
|||
|
|
| FMHA (causal, 8192) | TFLOPS | 13.7x |
|
|||
|
|
| FMHA (non-causal, 8192) | TFLOPS | 15.1x |
|
|||
|
|
| MatMul (8192) | TFLOPS | 18.9x |
|
|||
|
|
| BMM (batch8, 4096) | TFLOPS | 19.4x |
|
|||
|
|
| Group GEMM (4096) | TFLOPS | 23.9x |
|
|||
|
|
| RMSNorm (4096) | GB/s | 33.1x |
|
|||
|
|
| RoPE (16384) | GB/s | 22.8x |
|
|||
|
|
|
|||
|
|
**Key Observations:**
|
|||
|
|
- Compute-heavy kernels typically scale 14-24x from GB10 to B300
|
|||
|
|
- Memory-bound kernels can scale 20-33x due to HBM bandwidth advantage
|
|||
|
|
|
|||
|
|
## Qwen2-7B Performance
|
|||
|
|
|
|||
|
|
### End-to-End Throughput
|
|||
|
|
|
|||
|
|
| Configuration | DGX Spark | B300 | Platform Speedup |
|
|||
|
|
|---------------|-----------|------|------------------|
|
|||
|
|
| **cuTile** | 18.52 tok/s | 257.33 tok/s | **13.9x** |
|
|||
|
|
|
|||
|
|
### CUDA Kernel Time
|
|||
|
|
|
|||
|
|
| Configuration | DGX Spark | B300 | Platform Speedup |
|
|||
|
|
|---------------|-----------|------|------------------|
|
|||
|
|
| **cuTile** | 43,080 ms | 2,954 ms | **14.6x** |
|
|||
|
|
|
|||
|
|
### cuTile Kernel Breakdown
|
|||
|
|
|
|||
|
|
**DGX Spark (GB10):**
|
|||
|
|
|
|||
|
|
| Kernel | CUDA Time (ms) | Calls |
|
|||
|
|
|--------|----------------|-------|
|
|||
|
|
| `fmha_kernel` | 4,185.9 | 28 |
|
|||
|
|
| `swiglu_forward_kernel` | 2,459.8 | 1,400 |
|
|||
|
|
| `attention_decode_kernel_grouped` | 2,271.8 | 1,372 |
|
|||
|
|
| `rms_norm_kernel_static_persistent` | 634.7 | 57 |
|
|||
|
|
| `rope_kernel` | 355.6 | 1,400 |
|
|||
|
|
|
|||
|
|
**B300:**
|
|||
|
|
|
|||
|
|
| Kernel | CUDA Time (ms) | Speedup vs Spark |
|
|||
|
|
|--------|----------------|------------------|
|
|||
|
|
| `fmha_kernel` | 337.9 | 12.4x |
|
|||
|
|
| `swiglu_forward_kernel` | 226.3 | 10.9x |
|
|||
|
|
| `attention_decode_kernel_grouped` | 111.0 | 20.5x |
|
|||
|
|
| `rms_norm_kernel_static_persistent` | 29.7 | 21.4x |
|
|||
|
|
| `rope_kernel` | 16.7 | 21.3x |
|
|||
|
|
|
|||
|
|
**Same code, different architectures** - cuTile JIT compiles for sm_121 (Spark) and sm_103 (B300)
|
|||
|
|
|
|||
|
|
## Platform Specifications
|
|||
|
|
|
|||
|
|
| Specification | DGX Spark (GB10) | B300 |
|
|||
|
|
|---------------|------------------|------|
|
|||
|
|
| Compute Capability | sm_121 (12.1) | sm_103 (10.3) |
|
|||
|
|
| SMs | 48 | 132 |
|
|||
|
|
| Memory | 128 GB LPDDR5x | 192 GB HBM3e |
|
|||
|
|
| Memory Bandwidth | 273 GB/s | 8 TB/s |
|
|||
|
|
|
|||
|
|
## Troubleshooting
|
|||
|
|
|
|||
|
|
| Symptom | Cause | Fix |
|
|||
|
|
|---------|-------|-----|
|
|||
|
|
| `docker: permission denied` | User not in docker group | `sudo usermod -aG docker $USER && newgrp docker` |
|
|||
|
|
| `401 Client Error: Unauthorized` | Missing HuggingFace token | `export HF_TOKEN=<your_token>` |
|
|||
|
|
| `ModuleNotFoundError: tilegym` | TileGym not installed | `cd TileGym && pip install .` |
|
|||
|
|
| `RuntimeError: CUDA out of memory` | Model too large | Reduce batch size or use smaller model |
|
|||
|
|
| `Killed` during model load | Out of system memory | Clear cache: `sync; echo 3 > /proc/sys/vm/drop_caches` |
|
|||
|
|
| Slow first run | JIT compilation | Normal - cuTile compiles kernels on first run |
|
|||
|
|
| `FileNotFoundError: input_prompt_small.txt` | Missing input file | Run from `modeling/transformers` directory |
|
|||
|
|
| `torch.cuda.OutOfMemoryError` | Insufficient GPU memory | Reduce `--batch_size` parameter |
|
|||
|
|
| `ImportError: cuda.tile` | Missing Tile IR | Install: `apt-get install cuda-tile-ir-13-1` |
|
|||
|
|
| Benchmark hangs | GPU busy or locked | Check `nvidia-smi` for other processes |
|
|||
|
|
|
|||
|
|
> [!NOTE]
|
|||
|
|
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
|||
|
|
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
|
|||
|
|
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
> [!TIP]
|
|||
|
|
> First run of cuTile kernels includes JIT compilation overhead. Subsequent runs will be faster as compiled kernels are cached.
|
|||
|
|
|
|||
|
|
For the latest known issues, please review the [DGX Spark User Guide](https://docs.nvidia.com/dgx/dgx-spark/known-issues.html).
|