dgx-spark-playbooks/nvidia/station-ai-skills/assets/skills/dgx-diagnose/SKILL.md

---
name: dgx-diagnose
description: Diagnose common DGX Station GB300 issues — CUDA crashes, wrong-GPU targeting, vLLM/SGLang container bugs, MIG state problems, NVLink/Fabric Manager errors, X/Vulkan failures, HuggingFace auth, and port conflicts. Use when the user reports a GPU error, inference server crash, MIG problem, or any unexplained DGX Station failure.
metadata:
  publisher: nvidia
  hardware: DGX Station GB300
---

# DGX Station Diagnostics

Diagnose common DGX Station issues. Run through the checks below to identify the problem.

## Step 1. Gather system state

Run these commands and analyze the output:

```bash
# GPU status
nvidia-smi

# GPU device list with indices
nvidia-smi --query-gpu=index,name,memory.used,memory.total --format=csv,noheader

# Driver version
nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1

# MIG state
nvidia-smi -i 1 -q 2>/dev/null | grep -i "MIG Mode" || echo "Could not query MIG on device 1"

# Fabric Manager
systemctl is-active nvidia-fabricmanager

# GPU processes
sudo fuser -v /dev/nvidia* 2>/dev/null || echo "No GPU processes found"

# Docker containers using GPUs
docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}" 2>/dev/null
```

## Step 2. Match symptoms to known issues

Based on the gathered state and the user's reported problem, check for these known issues:

### CUDA crashes with `--gpus all`
**Cause:** Mixed coherency — GB300 (ATS) and RTX PRO (non-ATS) cannot share a CUDA context.
**Fix:** Use `--gpus '"device=N"'` targeting only the GB300.

### Model running on wrong GPU (RTX PRO instead of GB300)
**Check:** The device index in the docker command vs actual GPU indices.
**Fix:** Verify with `nvidia-smi --query-gpu=index,name --format=csv,noheader` and correct the `--gpus` flag.

### vLLM crash / FlashInfer buffer overflow
**Check:** Container version — `docker inspect vllm-server | grep Image`
**Fix:** Use `nvcr.io/nvidia/vllm:26.01-py3`. Version 25.10 has a known FlashInfer bug on DGX Station.

### SGLang CUDA errors
**Check:** Container tag — must be `cu130` for Blackwell SM103.
**Fix:** Use `lmsysorg/sglang:latest-cu130`.

### CUDA OOM despite 279 GB HBM
**Check:** `--max-model-len` / `--context-length` and memory utilization settings.
**Fix:** Reduce context length or lower `--gpu-memory-utilization` / `--mem-fraction-static`.

### `nvidia-smi -mig 1` returns "In use by another client"
**Check:** `sudo fuser -v /dev/nvidia*` — GPU processes must be stopped first.
**Fix:** Stop all GPU workloads, then retry.

### NVLink errors after disabling MIG
**Check:** `systemctl is-active nvidia-fabricmanager`
**Fix:** `sudo systemctl start nvidia-fabricmanager`

### X server crash after nvidia-xconfig -a
**Fix:** `sudo cp /etc/X11/xorg.conf.nvidia-xconfig-original /etc/X11/xorg.conf`

### Vulkan VK_ERROR_INITIALIZATION_FAILED
**Cause:** CUDA initialized before Vulkan, binding to GB300.
**Fix:** Run CUDA and Vulkan workloads in separate processes. For Vulkan apps: `__GL_DeviceModalityPreference=2 ./your_app`

### HuggingFace 401 / token errors
**Fix:** Pass token inline: `-e HF_TOKEN="hf_..."`. Don't rely on shell export for background Docker tasks.

### Port already in use
**Check:** `lsof -i :<PORT>`
**Fix:** Stop the conflicting process or use a different host port: `-p 8001:8000`.

## Step 3. Report findings

Tell the user:
1. What the issue is
2. Why it happens (root cause)
3. The specific command to fix it
4. How to verify the fix worked
chore: Regenerate all playbooks 2026-05-30 11:49:27 +00:00			`---`
			`name: dgx-diagnose`
			`description: Diagnose common DGX Station GB300 issues — CUDA crashes, wrong-GPU targeting, vLLM/SGLang container bugs, MIG state problems, NVLink/Fabric Manager errors, X/Vulkan failures, HuggingFace auth, and port conflicts. Use when the user reports a GPU error, inference server crash, MIG problem, or any unexplained DGX Station failure.`
			`metadata:`
			`publisher: nvidia`
			`hardware: DGX Station GB300`
			`---`

			`# DGX Station Diagnostics`

			`Diagnose common DGX Station issues. Run through the checks below to identify the problem.`

			`## Step 1. Gather system state`

			`Run these commands and analyze the output:`

			```bash
			`# GPU status`
			`nvidia-smi`

			`# GPU device list with indices`
			`nvidia-smi --query-gpu=index,name,memory.used,memory.total --format=csv,noheader`

			`# Driver version`
			`nvidia-smi --query-gpu=driver_version --format=csv,noheader \| head -1`

			`# MIG state`
			`nvidia-smi -i 1 -q 2>/dev/null \| grep -i "MIG Mode" \|\| echo "Could not query MIG on device 1"`

			`# Fabric Manager`
			`systemctl is-active nvidia-fabricmanager`

			`# GPU processes`
			`sudo fuser -v /dev/nvidia* 2>/dev/null \|\| echo "No GPU processes found"`

			`# Docker containers using GPUs`
			`docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}" 2>/dev/null`
			```

			`## Step 2. Match symptoms to known issues`

			`Based on the gathered state and the user's reported problem, check for these known issues:`

			### CUDA crashes with `--gpus all`
			`Cause: Mixed coherency — GB300 (ATS) and RTX PRO (non-ATS) cannot share a CUDA context.`
			Fix: Use `--gpus '"device=N"'` targeting only the GB300.

			`### Model running on wrong GPU (RTX PRO instead of GB300)`
			`Check: The device index in the docker command vs actual GPU indices.`
			Fix: Verify with `nvidia-smi --query-gpu=index,name --format=csv,noheader` and correct the `--gpus` flag.

			`### vLLM crash / FlashInfer buffer overflow`
			Check: Container version — `docker inspect vllm-server \| grep Image`
			Fix: Use `nvcr.io/nvidia/vllm:26.01-py3`. Version 25.10 has a known FlashInfer bug on DGX Station.

			`### SGLang CUDA errors`
			Check: Container tag — must be `cu130` for Blackwell SM103.
			Fix: Use `lmsysorg/sglang:latest-cu130`.

			`### CUDA OOM despite 279 GB HBM`
			Check: `--max-model-len` / `--context-length` and memory utilization settings.
			Fix: Reduce context length or lower `--gpu-memory-utilization` / `--mem-fraction-static`.

			### `nvidia-smi -mig 1` returns "In use by another client"
			Check: `sudo fuser -v /dev/nvidia*` — GPU processes must be stopped first.
			`Fix: Stop all GPU workloads, then retry.`

			`### NVLink errors after disabling MIG`
			Check: `systemctl is-active nvidia-fabricmanager`
			Fix: `sudo systemctl start nvidia-fabricmanager`

			`### X server crash after nvidia-xconfig -a`
			Fix: `sudo cp /etc/X11/xorg.conf.nvidia-xconfig-original /etc/X11/xorg.conf`

			`### Vulkan VK_ERROR_INITIALIZATION_FAILED`
			`Cause: CUDA initialized before Vulkan, binding to GB300.`
			Fix: Run CUDA and Vulkan workloads in separate processes. For Vulkan apps: `__GL_DeviceModalityPreference=2 ./your_app`

			`### HuggingFace 401 / token errors`
			Fix: Pass token inline: `-e HF_TOKEN="hf_..."`. Don't rely on shell export for background Docker tasks.

			`### Port already in use`
			Check: `lsof -i :<PORT>`
			Fix: Stop the conflicting process or use a different host port: `-p 8001:8000`.

			`## Step 3. Report findings`

			`Tell the user:`
			`1. What the issue is`
			`2. Why it happens (root cause)`
			`3. The specific command to fix it`
			`4. How to verify the fix worked`