dgx-spark-playbooks/nvidia/station-ai-skills/assets/.codex/prompts/mig-configure.md
2026-05-30 11:49:27 +00:00

2.9 KiB
Raw Permalink Blame History

MIG Configuration on DGX Station

Configure MIG (Multi-Instance GPU) partitions on the DGX Station GB300.

Steps

  1. Find the GB300 GPU index. Run:

    nvidia-smi --query-gpu=index,name --format=csv,noheader
    
  2. Check current MIG state:

    nvidia-smi -i <GB300_INDEX> -q | grep -i "MIG Mode"
    
  3. If MIG is already enabled, show current instances:

    nvidia-smi mig -lgi -i <GB300_INDEX>
    nvidia-smi mig -lci -i <GB300_INDEX>
    

    If the user wants to reconfigure, destroy existing instances first (step 6).

  4. If MIG is not enabled, enable it. All GPU processes must be stopped first:

    # Check for running GPU processes
    sudo fuser -v /dev/nvidia*
    
    # Enable MIG
    sudo nvidia-smi -i <GB300_INDEX> -mig 1
    
    # Verify
    nvidia-smi -i <GB300_INDEX> -q | grep -i "MIG Mode"
    
  5. Show available profiles and help the user choose a layout:

    nvidia-smi mig -lgip -i <GB300_INDEX>
    

    Common GB300 MIG profiles:

    Profile ID Memory Use case
    1g.35gb 19 ~35 GB Small models (7-8B), dev/test
    1g.35gb+me 20 ~35 GB Same + media extensions
    1g.70gb 15 ~70 GB Slightly larger inference
    2g.70gb 14 ~70 GB Medium models (14-30B)
    3g.139gb 9 ~139 GB Large models (70B quantized)
    4g.139gb 5 ~139 GB Large models, more compute
    7g.278gb 0 ~278 GB Full GPU as single instance

    Suggest layouts based on the user's workload. Examples:

    • Two models (70B + 8B): 3g.139gb + 2g.70gb + 2g.70gb → IDs 9,14,14
    • Many small models: 7 × 1g.35gb → IDs 19,19,19,19,19,19,19
    • One large model with isolation: 7g.278gb → ID 0

    Ask the user what models they want to run before suggesting a layout.

  6. Create (or recreate) instances:

    If reconfiguring, destroy existing instances first:

    sudo nvidia-smi mig -dci -i <GB300_INDEX>
    sudo nvidia-smi mig -dgi -i <GB300_INDEX>
    

    Then create the new layout:

    sudo nvidia-smi mig -cgi <PROFILE_IDS> -C -i <GB300_INDEX>
    
  7. Get the MIG device UUIDs:

    nvidia-smi -L
    

    Note the MIG-<uuid> entries — these are used to target specific MIG instances.

  8. Show the user how to use MIG devices:

    # Bare metal
    export CUDA_VISIBLE_DEVICES=MIG-<uuid>
    
    # Docker
    docker run --gpus '"device=MIG-<uuid>"' ...
    
  9. Report the final layout to the user with UUIDs and suggested docker commands for each instance.

Disabling MIG

If the user wants to return to full-GPU mode:

# Stop all workloads using MIG instances first
sudo nvidia-smi mig -dci -i <GB300_INDEX>
sudo nvidia-smi mig -dgi -i <GB300_INDEX>
sudo nvidia-smi -i <GB300_INDEX> -mig 0

# Ensure Fabric Manager is running for NVLink re-initialization
sudo systemctl start nvidia-fabricmanager