mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-06-18 04:22:21 +00:00
267 lines
11 KiB
YAML
267 lines
11 KiB
YAML
kind: Playbook
|
||
metadata:
|
||
name: station-mig
|
||
displayName: MIG on DGX Station
|
||
shortDescription: Enable and configure Multi-Instance GPU (MIG) on DGX Station with GB300 Ultra (B300 GPUs)
|
||
|
||
publisher: nvidia
|
||
description: |
|
||
# REPLACE THIS WITH YOUR MODEL CARD
|
||
https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads
|
||
|
||
labelsV2:
|
||
- gpuType:playbook:gpu_type_station
|
||
- DGX Station
|
||
- GB300
|
||
- MIG
|
||
- GPU Partitioning
|
||
- B300
|
||
- System Configuration
|
||
|
||
attributes:
|
||
- key: DURATION
|
||
value: 15 MIN
|
||
|
||
spec:
|
||
artifactName: station-mig
|
||
nvcfFunctionId: None
|
||
attributes:
|
||
|
||
showUnavailableBanner: false
|
||
apiDocsUrl: None
|
||
termsOfUse: |
|
||
|
||
cta:
|
||
text: MIG User Guide
|
||
url: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
|
||
|
||
|
||
tabs:
|
||
-
|
||
id: overview
|
||
|
||
label: Overview
|
||
content: |
|
||
# Basic idea
|
||
|
||
**Multi-Instance GPU (MIG)** lets you partition a single NVIDIA B300 GPU on your DGX Station (GB300 Ultra) into multiple smaller GPU instances. Each instance has dedicated memory and compute, so you can run multiple workloads or users on one physical GPU without sharing memory. This playbook walks you through enabling MIG, creating a B300 MIG layout, and using the instances from bare-metal apps or containers.
|
||
|
||
MIG is controlled via `nvidia-smi`: you enable MIG mode, then create GPU and compute instances using B300 profile IDs (e.g. 1g.34gb, 2g.67gb, 7g.269gb). When you no longer need partitioning, you disable MIG to restore full-GPU and NVLink P2P.
|
||
|
||
# What you'll accomplish
|
||
|
||
You will have MIG enabled and configured on your DGX Station B300 GPUs and know how to use the instances.
|
||
|
||
- **Enable MIG** on all B300 GPUs or on a per-GPU basis.
|
||
- **Create a MIG layout** using B300 profile IDs (with a known-good example for multiple GPUs).
|
||
- **Verify** the layout with `nvidia-smi -L` and `nvidia-smi mig -lgi` / `-lci`.
|
||
- **Run workloads** by setting `CUDA_VISIBLE_DEVICES` to a MIG UUID or by using the container/Kubernetes flows from the MIG User Guide.
|
||
- **Disable MIG** when you need full-GPU mode and NVLink again.
|
||
|
||
# What to know before starting
|
||
|
||
- Basic Linux command line and use of `sudo`.
|
||
- Familiarity with `nvidia-smi` and GPU indices.
|
||
- Optional: understanding of CUDA_VISIBLE_DEVICES and containers if you plan to run workloads on MIG instances.
|
||
|
||
# Prerequisites
|
||
|
||
**Hardware:**
|
||
|
||
- NVIDIA DGX Station with GB300 Ultra Superchip (B300 GPUs).
|
||
- No additional storage requirement for MIG configuration itself.
|
||
|
||
**Software:**
|
||
|
||
- NVIDIA driver and `nvidia-smi` installed and working: `nvidia-smi`
|
||
- Root or sudo access to run `nvidia-smi -mig 1`, `-mig 0`, and `nvidia-smi mig -cgi ... -C`
|
||
- For containers/K8s: nvidia-container-toolkit and MIG support as described in the [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/)
|
||
|
||
# Ancillary files
|
||
|
||
This playbook does not use repository assets; all steps use `nvidia-smi` and MIG commands on the DGX Station. For container and Kubernetes setup, use the official [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) (Getting Started with MIG and Kubernetes sections).
|
||
|
||
# Time & risk
|
||
|
||
- **Estimated time:** About 15 minutes to enable MIG, create a layout, and verify. Layout design (which profiles per GPU) may take longer if you customize.
|
||
- **Risk level:** Low to Medium
|
||
- Enabling or disabling MIG requires sudo and affects all workloads on that GPU.
|
||
- Disabling MIG removes all MIG instances; ensure Fabric Manager is running on DGX/HGX B200/B300 so NVLink/NVSwitch re-initialize correctly.
|
||
- **Rollback:** Run `sudo nvidia-smi -mig 0` to disable MIG and return to a single full-GPU instance per B300.
|
||
- **Last Updated:** February 2025
|
||
- First publication.
|
||
|
||
|
||
|
||
-
|
||
id: instructions
|
||
|
||
label: Instructions
|
||
content: |
|
||
# Step 1. Prerequisites and verify B300 GPUs
|
||
|
||
Ensure your DGX Station is running with B300 GPUs (GB300 Ultra) and that the NVIDIA driver and `nvidia-smi` are available. You need root or sudo to enable MIG and create instances.
|
||
|
||
```bash
|
||
nvidia-smi
|
||
nvidia-smi -L
|
||
```
|
||
|
||
Expected output should list one or more **NVIDIA B300** devices (e.g. `NVIDIA B300 SXM6 AC`). If you see B300 GPUs, you can proceed to enable MIG.
|
||
|
||
# Step 2. Enable MIG mode on the B300 GPUs
|
||
|
||
Enable MIG for all GPUs in the system or for a specific GPU. This must be done with elevated privileges.
|
||
|
||
**Enable MIG on all GPUs:**
|
||
|
||
```bash
|
||
sudo nvidia-smi -mig 1
|
||
```
|
||
|
||
**Or enable MIG on a single GPU (e.g. GPU 0 only):**
|
||
|
||
```bash
|
||
sudo nvidia-smi -i 0 -mig 1
|
||
```
|
||
|
||
Enabling MIG partitions each B300 into multiple GPU Instances; you will create and assign profiles in the next steps.
|
||
|
||
# Step 3. Verify MIG mode and inspect B300 profiles
|
||
|
||
Confirm that MIG mode is enabled:
|
||
|
||
```bash
|
||
nvidia-smi -q | grep -i mig
|
||
# or for a specific GPU:
|
||
nvidia-smi -i 0 -q | grep -i "MIG Mode"
|
||
```
|
||
|
||
Expected output should show MIG Mode: **Enabled**.
|
||
|
||
List the GPU Instance Profiles available on a B300 (e.g. GPU 0). These profile IDs are used when creating MIG instances:
|
||
|
||
```bash
|
||
nvidia-smi mig -lgip -i 0
|
||
```
|
||
|
||
On B300 you should see profiles such as:
|
||
|
||
- MIG 1g.34gb (ID 19)
|
||
- MIG 1g.34gb+me (ID 20)
|
||
- MIG 1g.67gb (ID 15)
|
||
- MIG 2g.67gb (ID 14)
|
||
- MIG 3g.135gb (ID 9)
|
||
- MIG 4g.135gb (ID 5)
|
||
- MIG 7g.269gb (ID 0)
|
||
|
||
Note the **IDs**; you will pass them to `-cgi` when creating the layout.
|
||
|
||
# Step 4. Create a MIG layout (example for B300)
|
||
|
||
Create GPU and compute instances using the profile IDs from Step 3. The basic pattern is:
|
||
|
||
```bash
|
||
sudo nvidia-smi mig -cgi <profile_id,profile_id,...> -C -i <gpu_index>
|
||
```
|
||
|
||
Example layout for a 6-GPU DGX Station (adjust GPU indices and counts to match your system). Each GPU can have any combination of profiles that fits within its capacity:
|
||
|
||
```bash
|
||
# GPU 0: 7 × 1g.34gb
|
||
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 -C -i 0
|
||
|
||
# GPU 1: 4 × 1g.67gb
|
||
sudo nvidia-smi mig -cgi 15,15,15,15 -C -i 1
|
||
|
||
# GPU 2: 3 × 2g.67gb
|
||
sudo nvidia-smi mig -cgi 14,14,14 -C -i 2
|
||
|
||
# GPU 3: 2 × 3g.135gb
|
||
sudo nvidia-smi mig -cgi 9,9 -C -i 3
|
||
|
||
# GPU 4: 1 × 4g.135gb
|
||
sudo nvidia-smi mig -cgi 5 -C -i 4
|
||
|
||
# GPU 5: 1 × 7g.269gb (full GPU as a single MIG instance)
|
||
sudo nvidia-smi mig -cgi 0 -C -i 5
|
||
```
|
||
|
||
You can choose any valid combination of profile IDs per GPU that fits within the B300’s capacity; the above is a known-good example. If your DGX Station has fewer than 6 GPUs, run only the `-i <N>` commands for GPUs that exist (e.g. 0 and 1 only).
|
||
|
||
# Step 5. Verify MIG instances
|
||
|
||
Check the resulting MIG device layout:
|
||
|
||
```bash
|
||
nvidia-smi -L
|
||
```
|
||
|
||
You should see each physical **NVIDIA B300 SXM6 AC** followed by its MIG devices, for example:
|
||
|
||
```
|
||
GPU 0: NVIDIA B300 SXM6 AC (UUID: GPU-...)
|
||
MIG 1g.34gb Device 0: (UUID: MIG-...)
|
||
MIG 1g.34gb Device 1: (UUID: MIG-...)
|
||
...
|
||
GPU 1: NVIDIA B300 SXM6 AC (UUID: GPU-...)
|
||
MIG 1g.67gb Device 0: (UUID: MIG-...)
|
||
...
|
||
```
|
||
|
||
To list GPU instances and compute instances:
|
||
|
||
```bash
|
||
nvidia-smi mig -lgi # list GPU instances
|
||
nvidia-smi mig -lci # list compute instances
|
||
```
|
||
|
||
# Step 6. Using the MIG devices
|
||
|
||
**Bare-metal CUDA applications:** set `CUDA_VISIBLE_DEVICES` to a MIG device UUID (from `nvidia-smi -L`):
|
||
|
||
```bash
|
||
export CUDA_VISIBLE_DEVICES=MIG-<uuid>
|
||
./your_app
|
||
```
|
||
|
||
**Containers and Kubernetes:** use the NVIDIA MIG User Guide “Getting Started with MIG” and the Kubernetes sections. They cover the nvidia-container-toolkit, device plugin, and nvidia-mig-manager workflows for exposing MIG instances to containers.
|
||
|
||
# Step 7. Disabling MIG and restoring full GPU
|
||
|
||
When you need full NVLink P2P and a single full-GPU instance again, disable MIG on all GPUs:
|
||
|
||
> [!WARNING]
|
||
> This removes all MIG instances and returns each B300 to a single full-GPU instance. Any workloads using MIG UUIDs will need to be reconfigured or restarted.
|
||
|
||
```bash
|
||
sudo nvidia-smi -mig 0
|
||
```
|
||
|
||
This resets the GPUs. On DGX/HGX B200/B300, ensure **Fabric Manager** is running so that NVLinks and NVSwitch fabric routing are re-initialized after MIG is disabled.
|
||
|
||
|
||
|
||
-
|
||
id: troubleshooting
|
||
|
||
label: Troubleshooting
|
||
content: |
|
||
| Symptom | Cause | Fix |
|
||
|--------|--------|-----|
|
||
| `nvidia-smi -mig 1` fails or "MIG mode not supported" | Driver too old or GPU not MIG-capable | Ensure you have a B300 (or other MIG-capable GPU) and a driver version that supports MIG on B300. Check `nvidia-smi -q` and [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) for supported hardware/driver. |
|
||
| `nvidia-smi mig -cgi ... -C -i N` fails (e.g. "Invalid combination") | Profile combination exceeds GPU capacity or invalid IDs | Run `nvidia-smi mig -lgip -i N` and use only listed profile IDs. Ensure the sum of instance sizes does not exceed the B300’s capacity for that GPU. |
|
||
| MIG instances not visible after creation | Instances not created or wrong GPU index | Run `nvidia-smi -L` and `nvidia-smi mig -lgi` to confirm. Re-run the `-cgi` commands for the correct `-i <gpu_index>`. |
|
||
| App doesn’t see MIG device when using CUDA_VISIBLE_DEVICES=MIG-<uuid> | Wrong UUID or app not using CUDA_VISIBLE_DEVICES | Get UUIDs from `nvidia-smi -L`. Export `CUDA_VISIBLE_DEVICES=MIG-<uuid>` in the same shell before launching the app. |
|
||
| After `nvidia-smi -mig 0`, NVLink or fabric issues on DGX/HGX | Fabric Manager not re-initializing | On DGX/HGX B200/B300, ensure Fabric Manager is running after disabling MIG so NVLinks and NVSwitch fabric are re-initialized. |
|
||
| Permission denied when running nvidia-smi -mig or mig -cgi | Need root for MIG operations | Use `sudo` for `nvidia-smi -mig 1/0` and `nvidia-smi mig -cgi ... -C`. |
|
||
|
||
|
||
|
||
|
||
resources:
|
||
- name: MIG User Guide (Getting Started with MIG)
|
||
url: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
|
||
|
||
|