sarman/dgx-spark-playbooks

Fork 0

mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-06-24 15:19:30 +00:00

GitLab CI 1d1a95b3cb chore: Regenerate all playbooks

2026-05-30 11:49:27 +00:00

3.4 KiB

Raw Blame History

name

description

metadata

dgx-diagnose

Diagnose common DGX Station GB300 issues — CUDA crashes, wrong-GPU targeting, vLLM/SGLang container bugs, MIG state problems, NVLink/Fabric Manager errors, X/Vulkan failures, HuggingFace auth, and port conflicts. Use when the user reports a GPU error, inference server crash, MIG problem, or any unexplained DGX Station failure.

publisher	hardware
nvidia	DGX Station GB300

DGX Station Diagnostics

Diagnose common DGX Station issues. Run through the checks below to identify the problem.

Step 1. Gather system state

Run these commands and analyze the output:

# GPU status
nvidia-smi

# GPU device list with indices
nvidia-smi --query-gpu=index,name,memory.used,memory.total --format=csv,noheader

# Driver version
nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1

# MIG state
nvidia-smi -i 1 -q 2>/dev/null | grep -i "MIG Mode" || echo "Could not query MIG on device 1"

# Fabric Manager
systemctl is-active nvidia-fabricmanager

# GPU processes
sudo fuser -v /dev/nvidia* 2>/dev/null || echo "No GPU processes found"

# Docker containers using GPUs
docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}" 2>/dev/null

Step 2. Match symptoms to known issues

Based on the gathered state and the user's reported problem, check for these known issues:

CUDA crashes with `--gpus all`

Cause: Mixed coherency — GB300 (ATS) and RTX PRO (non-ATS) cannot share a CUDA context. Fix: Use --gpus '"device=N"' targeting only the GB300.

Model running on wrong GPU (RTX PRO instead of GB300)

Check: The device index in the docker command vs actual GPU indices. Fix: Verify with nvidia-smi --query-gpu=index,name --format=csv,noheader and correct the --gpus flag.

vLLM crash / FlashInfer buffer overflow

Check: Container version — docker inspect vllm-server | grep Image Fix: Use nvcr.io/nvidia/vllm:26.01-py3. Version 25.10 has a known FlashInfer bug on DGX Station.

SGLang CUDA errors

Check: Container tag — must be cu130 for Blackwell SM103. Fix: Use lmsysorg/sglang:latest-cu130.

CUDA OOM despite 279 GB HBM

Check: --max-model-len / --context-length and memory utilization settings. Fix: Reduce context length or lower --gpu-memory-utilization / --mem-fraction-static.

`nvidia-smi -mig 1` returns "In use by another client"

Check: sudo fuser -v /dev/nvidia* — GPU processes must be stopped first. Fix: Stop all GPU workloads, then retry.

NVLink errors after disabling MIG

Check: systemctl is-active nvidia-fabricmanager Fix: sudo systemctl start nvidia-fabricmanager

X server crash after nvidia-xconfig -a

Fix: sudo cp /etc/X11/xorg.conf.nvidia-xconfig-original /etc/X11/xorg.conf

Vulkan VK_ERROR_INITIALIZATION_FAILED

Cause: CUDA initialized before Vulkan, binding to GB300. Fix: Run CUDA and Vulkan workloads in separate processes. For Vulkan apps: __GL_DeviceModalityPreference=2 ./your_app

HuggingFace 401 / token errors

Fix: Pass token inline: -e HF_TOKEN="hf_...". Don't rely on shell export for background Docker tasks.

Port already in use

Check: lsof -i :<PORT> Fix: Stop the conflicting process or use a different host port: -p 8001:8000.

Step 3. Report findings

Tell the user:

What the issue is
Why it happens (root cause)
The specific command to fix it
How to verify the fix worked

3.4 KiB Raw Blame History