chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2026-06-11 01:07:29 +00:00
parent a0e917e6f5
commit bc6bf2251e
12 changed files with 409 additions and 425 deletions

View File

@ -82,21 +82,18 @@ spec:
content: |
# Step 1. Log in to Brev
Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm youre in the correct org (by clicking the org button on the top right-hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation.
Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm youre in the correct org (by clicking the org button on the top right hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation.
Click the “Register Compute” button and follow the instructions in the pop-up window.
# Step 2. Complete Pop-up Instructions
# Step 2. Complete Popup Instructions
* Install the Brev CLI
* Configure your compute
* Add a name for compute
* To configure SSH, ensure the “Enable SSH access” toggle is on
* To configure ssh, ensure the “Enable SSH access” toggle is on
* Run the registration command
> [!IMPORTANT]
> Run the Brev CLI install command **without `sudo`**. Prefixing the installer with `sudo` writes the `brev` binary into root's home directory, which is not on your user shell's `PATH` — the next command will fail with `brev: command not found`. Copy the install command from the pop-up and run it as your normal user.
# Step 3. Follow Registration Flow
In the CLI, youll be walked through registration. Go through the flow until registration is complete.
@ -113,14 +110,10 @@ spec:
Now that your hardware is connected, you can:
* **Access your machine from anywhere:** Open the [Brev UI](https://brev.nvidia.com) and launch a session from [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
* **Share access with others:** Invite teammates to your DGX Station from the Brev UI:
* Go to the [Brev UI](https://brev.nvidia.com) and open [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
* Find your DGX Station in the list and open the row's three-dot (⋯) menu.
* Select **Share Access**.
* Enter the email address of the person you want to share with.
* Choose their role / permission level.
* Confirm to send the invitation.
* **Share Access Anywhere:** Access your machine from anywhere and share access with others through the Brev UI by:
* Adding the user to your [Team](https://brev.nvidia.com/org/team)
* Navigating to your instance in the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section
* In **SSH Access** section of the instance, search for the user you wish to add and click **Modify Access** to enable access
# Step 6. Cleanup
@ -135,7 +128,7 @@ spec:
In the UI:
* Go to the [Brev UI](https://brev.nvidia.com)
* Navigate to the section listing “GPU Environments” and look under “Registered Compute”
* Click the “Remove” menu item on the device you wish to delete from Brev.
* Click the “Remove” menu item on the DGX Station you wish to delete from Brev.
* Confirm your selection.

View File

@ -174,7 +174,7 @@ openshell --version
df -h /
```
Expected: Blackwell Ultra GPU, Docker >= 23.0.1, **Node.js v22.x** (the DGX Station ships with v18 — see below), OpenShell >= 0.0.33, and **at least 200 GB free** on `/` (86 GB model + Docker images + working space).
Expected: Blackwell Ultra GPU, Docker >= 23.0.1, **Node.js v22.x**, OpenShell >= 0.0.33, and **at least 200 GB free** on `/` (86 GB model + Docker images + working space).
> [!WARNING]
> If `openshell --version` says `command not found`, the binary is at `~/.local/bin/openshell` but isn't on PATH. Run the `export PATH=...` line above and re-source `~/.bashrc`. Without this, every `openshell` and `make` command in later steps fails.
@ -182,10 +182,14 @@ Expected: Blackwell Ultra GPU, Docker >= 23.0.1, **Node.js v22.x** (the DGX Stat
> [!TIP]
> `make prereq` (run from `~/clinical-intelligence` after Step 2) bundles all of the checks below — Docker, Node version, OpenShell, disk space, GPU, port 11434, and NGC auth — into one command.
**If `node --version` reports v18.x or older**, install Node.js v22 before continuing:
**If `node --version` reports v18.x, older, or `command not found`**, install Node.js v22 before continuing:
```bash
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
## Download the NodeSource setup script first, then run it with sudo.
## Running it inline with `| sudo bash` does not work — the sudo context
## needs to own the entire script execution.
curl -fsSL https://deb.nodesource.com/setup_22.x -o /tmp/nodesource_setup.sh
sudo bash /tmp/nodesource_setup.sh
sudo apt-get install -y nodejs
node --version # should now show v22.x
```
@ -204,10 +208,20 @@ ss -tlnp 2>/dev/null | grep 11434 || echo 'port 11434 free'
Expected: `port 11434 free`. If the line still shows a listener, something else (an old `ollama serve`, another container, etc.) owns the port — stop it, or change `OLLAMA_PORT` in `.env` (Step 2) to a free port such as `11435`. `make setup` sources `.env` and configures the sandbox provider against the override.
**Stale OpenShell gateway?** If you previously ran the NemoClaw playbook, an existing gateway will be silently reused under the new name. To start clean:
**Stale OpenShell gateway?** If you previously ran a playbook that started `openshell-gateway`, kill the process and remove the registration:
```bash
openshell gateway destroy 2>/dev/null || true
pkill -f openshell-gateway 2>/dev/null || true
openshell gateway remove openshell 2>/dev/null || true
```
**Previously ran the NemoClaw playbook?** NemoClaw installs `openclaw-gateway.service` as a systemd user service that binds port 18789. If it is still running, `make setup` fails with "Port 18789 is already in use". Stop and disable it before proceeding — `make setup` will also do this automatically, but stopping it here avoids a confusing error:
```bash
systemctl --user stop openclaw-gateway.service 2>/dev/null || true
systemctl --user disable openclaw-gateway.service 2>/dev/null || true
## Verify the port is free
ss -tlnp | grep 18789 || echo 'port 18789 free'
```
## Step 2. Copy the assets and configure
@ -290,8 +304,8 @@ make status
Expected:
```
Ollama: ✓ healthy
OpenFold3: ✓ healthy
Ollama (port 11434): ✓ healthy
OpenFold3 (port 8000): ✓ healthy
```
OpenFold3 takes ~3 minutes to load model weights on startup. If it shows "down (may still be loading)", wait and check again.
@ -301,26 +315,30 @@ OpenFold3 takes ~3 minutes to load model weights on startup. If it shows "down (
## Step 4. Start the OpenShell gateway
The OpenShell gateway runs a lightweight k3s Kubernetes cluster inside Docker to manage sandboxes. On DGX Station, the kernel uses cgroup v2 with the systemd driver, but k3s defaults to cgroupfs. The flag below tells k3s to match the host:
OpenShell >= 0.0.40 ships `openshell-gateway`, a standalone server binary installed alongside the CLI. Start it with the Docker driver (no Kubernetes required), then register it with the CLI:
```bash
OPENSHELL_K3S_ARGS='--kubelet-arg=cgroup-driver=systemd' openshell gateway start
## Start the gateway server in the background using the Docker compute driver.
## --disable-tls is safe for local-only use (loopback-bound).
nohup openshell-gateway \
--disable-tls \
--drivers docker \
--bind-address 127.0.0.1 \
--port 17670 \
> /tmp/openshell-gateway.log 2>&1 &
echo "Gateway PID: $!"
## Register the gateway with the CLI and set it as active.
openshell gateway add http://127.0.0.1:17670 --name openshell
```
Wait for the gateway's embedded k3s cluster to finish initializing (1015 seconds after `gateway start` returns), then verify:
Verify the gateway is connected:
```bash
## Wait until the gateway accepts connections, fail after 60s
for i in $(seq 1 30); do
if openshell status 2>/dev/null | grep -q "Connected"; then
echo "Gateway: Connected"; break
fi
sleep 2
done
openshell status
```
Expected: `Status: Connected`. If the first `openshell status` immediately after `gateway start` reports `Connection reset by peer`, that is normal — k3s is still warming up. The loop above polls until it is ready.
Expected: `Status: Connected`. If not connected, check `/tmp/openshell-gateway.log` for errors. The gateway typically starts in under 1 second.
> [!NOTE]
> Step 4 configures OpenShell infrastructure (gateway). Step 5 deploys the healthcare agent into this infrastructure.
@ -457,7 +475,8 @@ Skill files are Markdown. Edit a threshold or drug classification — it takes e
```bash
openshell sandbox delete clinical-sandbox
make down
openshell gateway destroy
pkill -f openshell-gateway 2>/dev/null || true
openshell gateway remove openshell 2>/dev/null || true
```
To also remove downloaded models and volumes:

View File

@ -130,7 +130,8 @@ test-docker: ## Run tests inside a container
teardown: ## Tear down sandbox, services, and gateway
openshell sandbox delete $${SANDBOX_NAME:-clinical-sandbox} 2>/dev/null || true
$(COMPOSE) down
openshell gateway destroy 2>/dev/null || true
pkill -f openshell-gateway 2>/dev/null || true
openshell gateway remove openshell 2>/dev/null || true
@echo "Teardown complete."
clean: ## Remove test results, PDB caches, and dangling images

View File

@ -19,9 +19,7 @@
# --local Bind gateway to 0.0.0.0 for local browser access (no SSH tunnel needed)
# Default: loopback only (requires SSH tunnel from remote machine)
#
# Machine differences:
# GB300: Docker bridge 172.18.0.1, no sg docker prefix
# New Station: Docker bridge 172.17.0.1, needs sg docker prefix
# The Docker bridge IP is auto-detected via 'ip -4 addr show docker0' below.
set -euo pipefail
BIND_MODE="loopback"
@ -129,6 +127,20 @@ if openshell sandbox list 2>/dev/null | grep -q "$SANDBOX_NAME"; then
sleep 3
fi
# Stop any host-level service that owns $PORT (e.g. openclaw-gateway.service
# installed by the NemoClaw playbook as a systemd --user service). systemd
# will respawn the process if only the PID is killed, so stop the unit first.
if ss -tlnp 2>/dev/null | grep -qE "[: ]${PORT}[^0-9]"; then
echo "Detected listener on host :$PORT — stopping before forwarding..."
systemctl --user stop openclaw-gateway.service 2>/dev/null || true
systemctl --user disable openclaw-gateway.service 2>/dev/null || true
# Kill any remaining listener not managed by systemd (e.g. stale PID)
if ss -tlnp 2>/dev/null | grep -qE "[: ]${PORT}[^0-9]"; then
fuser -k "${PORT}/tcp" 2>/dev/null || true
sleep 1
fi
fi
# Stop any stale port forwards on $PORT from prior (possibly deleted) sandboxes.
# Stale forwards block re-creation with a cryptic error like
# "× Port 18789 is already forwarded to sandbox 'dgx-demo'."
@ -202,13 +214,37 @@ done
echo ""
# --- Step 4: Upload repo into sandbox ---
# Note: openshell sandbox upload (>= 0.0.44) copies the source *directory itself*
# (like `cp -r src/ dest/` creates dest/src/), not just its contents. We therefore
# upload to /sandbox/ so that the source directory `clinical-intelligence` lands at
# /sandbox/clinical-intelligence/ rather than /sandbox/clinical-intelligence/clinical-intelligence/.
echo "--- Step 4: Upload repo ---"
openshell sandbox upload "$SANDBOX_NAME" "$REPO_DIR" /sandbox/clinical-intelligence
openshell sandbox upload "$SANDBOX_NAME" "$REPO_DIR" /sandbox/
# Fix nested directories caused by upload (analysis-methods/analysis-methods/)
# Resolve the active gateway name for the ssh-proxy ProxyCommand.
# Precedence: OPENSHELL_GATEWAY env var (set by the CLI for all subcommands) →
# active gateway from `openshell status` → fallback to 'openshell'.
# This prevents a failure when the user previously ran the NemoClaw playbook
# (which registers its gateway as 'nemoclaw' instead of 'openshell').
_gw_name() {
if [ -n "${OPENSHELL_GATEWAY:-}" ]; then
printf '%s' "$OPENSHELL_GATEWAY"
return
fi
local name
name=$(openshell status 2>/dev/null \
| grep -oE 'Gateway:[[:space:]]+[A-Za-z0-9_-]+' \
| awk '{print $NF}' | head -1)
printf '%s' "${name:-openshell}"
}
GW_NAME="$(_gw_name)"
_sandbox() {
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR \
-o "ProxyCommand=openshell ssh-proxy --gateway-name openshell --name $SANDBOX_NAME" \
-o ConnectTimeout=10 \
-o "ProxyCommand=openshell ssh-proxy --gateway-name $GW_NAME --name $SANDBOX_NAME" \
"sandbox@openshell-$SANDBOX_NAME" "$@"
}

View File

@ -209,7 +209,7 @@ spec:
df -h /
```
Expected: Blackwell Ultra GPU, Docker >= 23.0.1, **Node.js v22.x** (the DGX Station ships with v18 — see below), OpenShell >= 0.0.33, and **at least 200 GB free** on `/` (86 GB model + Docker images + working space).
Expected: Blackwell Ultra GPU, Docker >= 23.0.1, **Node.js v22.x**, OpenShell >= 0.0.33, and **at least 200 GB free** on `/` (86 GB model + Docker images + working space).
> [!WARNING]
> If `openshell --version` says `command not found`, the binary is at `~/.local/bin/openshell` but isn't on PATH. Run the `export PATH=...` line above and re-source `~/.bashrc`. Without this, every `openshell` and `make` command in later steps fails.
@ -217,10 +217,14 @@ spec:
> [!TIP]
> `make prereq` (run from `~/clinical-intelligence` after Step 2) bundles all of the checks below — Docker, Node version, OpenShell, disk space, GPU, port 11434, and NGC auth — into one command.
**If `node --version` reports v18.x or older**, install Node.js v22 before continuing:
**If `node --version` reports v18.x, older, or `command not found`**, install Node.js v22 before continuing:
```bash
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
# Download the NodeSource setup script first, then run it with sudo.
# Running it inline with `| sudo bash` does not work — the sudo context
# needs to own the entire script execution.
curl -fsSL https://deb.nodesource.com/setup_22.x -o /tmp/nodesource_setup.sh
sudo bash /tmp/nodesource_setup.sh
sudo apt-get install -y nodejs
node --version # should now show v22.x
```
@ -239,10 +243,20 @@ spec:
Expected: `port 11434 free`. If the line still shows a listener, something else (an old `ollama serve`, another container, etc.) owns the port — stop it, or change `OLLAMA_PORT` in `.env` (Step 2) to a free port such as `11435`. `make setup` sources `.env` and configures the sandbox provider against the override.
**Stale OpenShell gateway?** If you previously ran the NemoClaw playbook, an existing gateway will be silently reused under the new name. To start clean:
**Stale OpenShell gateway?** If you previously ran a playbook that started `openshell-gateway`, kill the process and remove the registration:
```bash
openshell gateway destroy 2>/dev/null || true
pkill -f openshell-gateway 2>/dev/null || true
openshell gateway remove openshell 2>/dev/null || true
```
**Previously ran the NemoClaw playbook?** NemoClaw installs `openclaw-gateway.service` as a systemd user service that binds port 18789. If it is still running, `make setup` fails with "Port 18789 is already in use". Stop and disable it before proceeding — `make setup` will also do this automatically, but stopping it here avoids a confusing error:
```bash
systemctl --user stop openclaw-gateway.service 2>/dev/null || true
systemctl --user disable openclaw-gateway.service 2>/dev/null || true
# Verify the port is free
ss -tlnp | grep 18789 || echo 'port 18789 free'
```
# Step 2. Copy the assets and configure
@ -325,8 +339,8 @@ spec:
Expected:
```
Ollama: ✓ healthy
OpenFold3: ✓ healthy
Ollama (port 11434): ✓ healthy
OpenFold3 (port 8000): ✓ healthy
```
OpenFold3 takes ~3 minutes to load model weights on startup. If it shows "down (may still be loading)", wait and check again.
@ -336,26 +350,30 @@ spec:
# Step 4. Start the OpenShell gateway
The OpenShell gateway runs a lightweight k3s Kubernetes cluster inside Docker to manage sandboxes. On DGX Station, the kernel uses cgroup v2 with the systemd driver, but k3s defaults to cgroupfs. The flag below tells k3s to match the host:
OpenShell >= 0.0.40 ships `openshell-gateway`, a standalone server binary installed alongside the CLI. Start it with the Docker driver (no Kubernetes required), then register it with the CLI:
```bash
OPENSHELL_K3S_ARGS='--kubelet-arg=cgroup-driver=systemd' openshell gateway start
# Start the gateway server in the background using the Docker compute driver.
# --disable-tls is safe for local-only use (loopback-bound).
nohup openshell-gateway \
--disable-tls \
--drivers docker \
--bind-address 127.0.0.1 \
--port 17670 \
> /tmp/openshell-gateway.log 2>&1 &
echo "Gateway PID: $!"
# Register the gateway with the CLI and set it as active.
openshell gateway add http://127.0.0.1:17670 --name openshell
```
Wait for the gateway's embedded k3s cluster to finish initializing (1015 seconds after `gateway start` returns), then verify:
Verify the gateway is connected:
```bash
# Wait until the gateway accepts connections, fail after 60s
for i in $(seq 1 30); do
if openshell status 2>/dev/null | grep -q "Connected"; then
echo "Gateway: Connected"; break
fi
sleep 2
done
openshell status
```
Expected: `Status: Connected`. If the first `openshell status` immediately after `gateway start` reports `Connection reset by peer`, that is normal — k3s is still warming up. The loop above polls until it is ready.
Expected: `Status: Connected`. If not connected, check `/tmp/openshell-gateway.log` for errors. The gateway typically starts in under 1 second.
> [!NOTE]
> Step 4 configures OpenShell infrastructure (gateway). Step 5 deploys the healthcare agent into this infrastructure.
@ -492,7 +510,8 @@ spec:
```bash
openshell sandbox delete clinical-sandbox
make down
openshell gateway destroy
pkill -f openshell-gateway 2>/dev/null || true
openshell gateway remove openshell 2>/dev/null || true
```
To also remove downloaded models and volumes:

View File

@ -2,7 +2,7 @@ kind: Playbook
metadata:
name: station-local-coding-agent
displayName: Local Coding Agent
shortDescription: Run local CLI coding agents with Ollama on DGX Station (NVIDIA GB300) using glm-4.7-flash (fast) or unsloth/GLM-4.7-GGUF:Q8_0 (best quality)
shortDescription: Run local CLI coding agents with Ollama on DGX Station (GB300 Ultra) using GLM-4.7 and GLM-4.7-Flash
publisher: nvidia
description: |
@ -17,6 +17,8 @@ metadata:
- LLM
- Ollama
- Claude Code
- OpenCode
- Codex
attributes:
- key: DURATION
@ -39,18 +41,24 @@ spec:
content: |
# Basic idea
Use Ollama on **DGX Station (NVIDIA GB300)** to run local coding models and connect a CLI coding agent. This
playbook uses **Claude Code** to talk to Ollama for local inference, so you can work without external cloud APIs.
Use Ollama on **DGX Station with GB300 Ultra** to run local coding models and connect a CLI coding agent. This
playbook supports three options: **Claude Code**, **OpenCode**, and **Codex CLI**. Each
agent talks to Ollama for local inference, so you can work without external cloud APIs.
The DGX Station GPU (reported as **NVIDIA GB300** in `nvidia-smi`) provides ample memory to run **glm-4.7-flash** (fast loading and testing) and larger models such as **unsloth/GLM-4.7-GGUF:Q8_0** (best quality), both supported on Ollama.
The GB300 Ultras massive GPU memory lets you run **GLM-4.7** and **GLM-4.7-Flash** in high-quality variants (e.g. bf16, q8_0) for the best coding-assistant quality directly on the Station.
# CLI agent
# Choose your CLI agent
This playbook uses **Claude Code** as the CLI agent, connected to a local Ollama model for inference.
Pick the tab that matches the CLI agent you want to use:
- **Claude Code**: Fastest path to a working CLI agent with a local Ollama model.
- **OpenCode**: Open-source CLI with provider configuration; this guide targets Ollama.
- **Codex CLI**: OpenAI Codex CLI configured to run against Ollama locally.
# What you'll accomplish
You will run a local coding model on your **DGX Station (NVIDIA GB300)** with Ollama, connect Claude Code to it, and complete a small coding task end-to-end. Use **glm-4.7-flash** (including high-quality variants) or **unsloth/GLM-4.7-GGUF:Q8_0** for best quality.
You will run a local coding model on your **DGX Station (GB300 Ultra)** with Ollama, connect it to your
chosen CLI agent, and complete a small coding task end-to-end. You can use **GLM-4.7** or **GLM-4.7-Flash** (including high-quality variants) to take full advantage of the Stations memory.
# What to know before starting
@ -60,14 +68,13 @@ spec:
# Prerequisites
- **DGX Station** with **NVIDIA GB300** (Grace Blackwell) and NVIDIA driver; `nvidia-smi` typically shows "NVIDIA GB300"
- **DGX Station** with **GB300 Ultra** (Grace Blackwell) and NVIDIA driver
- Internet access to download model weights
- **Ollama 0.15.0 or newer** (required for GLM-4.7-Flash; do not pin to 0.14.3)
- **GPU memory** on GB300 supports both recommended models:
- **glm-4.7-flash**: ~19 GB (`latest`) to ~60 GB (bf16) — **recommended for fast loading and testing**
- **unsloth/GLM-4.7-GGUF:Q8_0** (Hugging Face on Ollama): larger model — **recommended for best quality**
- Other variants (e.g. `glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) fit on GB300
- **Disk space** for model downloads: plan for ~19 GB for `glm-4.7-flash:latest`, plus additional space for the Q8_0 or bf16 variants if you use them
- Ollama 0.14.3 or newer
- **GPU memory** on GB300 Ultra supports GLM-4.7 and high-quality variants:
- **GLM-4.7-Flash** (30B): ~19GB (latest) to ~60GB (bf16) — recommended default for coding
- **GLM-4.7** (full): use `ollama pull glm-4.7` for higher quality when available
- High-quality variants (e.g. `glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) fit comfortably on GB300 Ultra
# Time & risk
@ -76,8 +83,8 @@ spec:
* Large model downloads can fail if network connectivity is unstable
* Older Ollama versions will not load newer models
* **Rollback**: Stop Ollama and delete the downloaded model from `~/.ollama/models`
* **Last Updated:** 03/06/2026
* Model set to glm-4.7-flash; Ollama 0.15.0+; cleanup order and docs refresh
* **Last Updated:** February 2025
* Tailored for DGX Station with GB300 Ultra; added large-model recommendations
@ -94,91 +101,51 @@ spec:
nvidia-smi
```
**Expected output** (example): A table showing driver version and GPU(s). On DGX Station, the GPU name may appear as **NVIDIA GB300** (without "Ultra"):
```text
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 5xx.xx Driver Version: 5xx.xx CUDA Version: 12.x |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| 0 NVIDIA GB300 On | 00000000:06:00.0 Off | 0 |
...
```
Expected output should show a detected GPU (e.g. GB300 Ultra).
# Step 2. Install or update Ollama
**Description**: Install Ollama or ensure it is recent enough for modern coding models.
```bash
curl -fsSL https://ollama.com/install.sh | sh
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.14.3 sh
ollama --version
```
To install a specific version (e.g. 0.15.0 or newer, required for GLM-4.7-Flash):
```bash
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.15.0 sh
```
If Ollama is already present and the version is 0.15.0 or newer, simply run:
If the ollama is already present and the version is 0.14.3 or newer, simply run:
```bash
ollama --version
```
**Expected output** (example):
```text
ollama version is 0.15.0
```
Expected output should show `ollama --version` as 0.14.3 or newer.
# Step 3. Pull a coding model
**Description**: Download the model weights to your DGX Station. This playbook supports two model options on Ollama; choose one (or both) depending on whether you want **fast loading and testing** or **best quality**.
**Description**: Download the model weights to your DGX Station. This playbook uses **GLM-4.7** where available.
**For fast loading and testing** — **glm-4.7-flash** (~19 GB for `latest`; loads quickly; ensure Ollama 0.15.0+):
**Recommended: GLM-4.7**:
```bash
ollama pull glm-4.7-flash
ollama pull glm-4.7
```
**For best quality** — **unsloth/GLM-4.7-GGUF:Q8_0** from Hugging Face (larger, higher quality; supported on Ollama):
```bash
ollama pull hf.co/unsloth/GLM-4.7-GGUF:Q8_0
```
**Other glm-4.7-flash variants** on GB300 (more GPU memory; bf16 is ~60 GB):
**High-quality variants** on GB300 Ultra (use more GPU memory for better quality):
```bash
ollama pull glm-4.7-flash:q8_0
ollama pull glm-4.7-flash:bf16
```
**Expected output** (example): Progress lines followed by "success" and the model in `ollama list`:
```bash
ollama list
```
```text
NAME ID SIZE MODIFIED
glm-4.7-flash:latest abc123... 19 GB 1 minute ago
unsloth/GLM-4.7-GGUF:Q8_0 def456... ... ...
```
Expected output should show your model in `ollama list`.
# Step 4. Test local inference
**Description**: Run a quick prompt to confirm the model loads. Use the same model name you pulled (e.g. `glm-4.7-flash` for fast testing, or `hf.co/unsloth/GLM-4.7-GGUF:Q8_0` for best quality).
**Description**: Run a quick prompt to confirm the model loads. Use the same model name you pulled (e.g. `glm-4.7`).
```bash
ollama run glm-4.7-flash
```
Or, if you pulled the larger model:
```bash
ollama run hf.co/unsloth/GLM-4.7-GGUF:Q8_0
ollama run glm-4.7
```
Try a prompt like:
@ -187,9 +154,7 @@ spec:
Write a short README checklist for a Python project.
```
**Expected output**: GLM-4.7-Flash may show **Thinking...** and reasoning text before the final answer, then the model's response. This is normal; wait for the reply to complete.
**Exit the Ollama REPL** when done: type `/bye` or press **Ctrl+D**.
Expected output should show the model responding in the terminal.
# Step 5. Install Claude Code
@ -199,24 +164,16 @@ spec:
curl -fsSL https://claude.ai/install.sh | sh
```
**Verify the installation**:
```bash
claude --version
```
**Expected output** (example): A version string such as `claude 0.x.x` or similar. If you see `claude: command not found`, ensure the install script added the CLI to your PATH (e.g. restart the terminal or source your shell profile); see [Troubleshooting](troubleshooting.md).
# Step 6. Increase context length (optional)
**Description**: Ollama defaults to a 4096 token context length. For coding agents and
larger codebases, set it to 64K tokens. This increases memory usage.
For more details on configuring context length and other parameters, see the Ollama documentation (context window and runtime options).
For more details on configuring context length, see the [Ollama documentation](https://ollama.com/docs/faq#how-can-i-increase-the-context-length).
Set the context length per session in the Ollama REPL (use the same model name you pulled, e.g. `glm-4.7-flash` or `hf.co/unsloth/GLM-4.7-GGUF:Q8_0`):
Set the context length per session in the Ollama REPL:
```bash
ollama run glm-4.7-flash
ollama run glm-4.7
```
Then, in the Ollama prompt:
@ -226,8 +183,6 @@ spec:
```
**Exit when done**: type `/bye` or press **Ctrl+D**.
Optional method (set globally when serving Ollama):
```bash
@ -239,35 +194,16 @@ spec:
# Step 7. Connect Claude Code to Ollama
**Description**: Point Claude Code to the local Ollama server and launch it. Use the model you pulled: `glm-4.7-flash` (fast) or `hf.co/unsloth/GLM-4.7-GGUF:Q8_0` (best quality).
**Description**: Point Claude Code to the local Ollama server and launch it. Use the model you pulled (e.g. GLM-4.7 or GLM-4.7-Flash).
```bash
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_BASE_URL=http://localhost:11434
claude --model glm-4.7-flash
claude --model glm-4.7
```
If you are using the larger model:
```bash
claude --model hf.co/unsloth/GLM-4.7-GGUF:Q8_0
```
- **`ANTHROPIC_AUTH_TOKEN=ollama`**: Claude Code treats the literal value `ollama` as a special token that means "use the local Ollama backend" instead of Anthropic's cloud API. No real API key is needed when using Ollama.
- **`ANTHROPIC_BASE_URL`**: Tells Claude Code to send requests to your local Ollama server at port 11434.
**Persist these variables** (optional) so you don't have to re-export every terminal session. Add to `~/.bashrc` or your shell profile (e.g. `~/.zshrc`):
```bash
echo 'export ANTHROPIC_AUTH_TOKEN=ollama' >> ~/.bashrc
echo 'export ANTHROPIC_BASE_URL=http://localhost:11434' >> ~/.bashrc
source ~/.bashrc
```
**Expected output**: Claude Code starts and uses the local model.
**Exit Claude Code** when done: type `/exit` or press **Ctrl+C**.
Expected output should show Claude Code starting and using the local model.
# Step 8. Complete a small coding task
@ -287,13 +223,13 @@ spec:
python -m pip install -U pytest
```
In Claude Code, enter:
In Claude Code:
```text
Please implement add() in math_utils.py and make sure the test passes.
```
**Exit Claude Code** when finished: type `/exit` or press **Ctrl+C**, then run the test:
Run the test:
```bash
python -m pytest -q
@ -303,37 +239,26 @@ spec:
# Step 9. Cleanup and rollback
**Description**: Remove the model and stop the Ollama service if you no longer need them. **Remove the model first** (while the Ollama server is running), then stop the service.
**Description**: Remove the model and stop services if you no longer need them.
> [!WARNING]
> The following removes the downloaded model files from disk.
**1. Remove the model** (Ollama must be running). Use the same name you pulled:
```bash
ollama rm glm-4.7-flash
```
Or, for the Hugging Face model:
```bash
ollama rm hf.co/unsloth/GLM-4.7-GGUF:Q8_0
```
Use the exact tag you pulled (e.g. `glm-4.7-flash:bf16` if you used that variant).
**2. Stop the Ollama service**:
To stop the service:
```bash
sudo systemctl stop ollama
```
> [!WARNING]
> This will delete the downloaded model files.
```bash
ollama rm glm-4.7
```
# Step 10. Next steps
- **Fast loading and testing:** use **glm-4.7-flash** for quick iteration and smaller downloads.
- **Best quality:** use **unsloth/GLM-4.7-GGUF:Q8_0** (Hugging Face on Ollama) or **glm-4.7-flash** high-quality variants (`glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) on DGX Station (NVIDIA GB300).
- Use larger context (e.g. 64K198K) for big codebases.
- Use Claude Code on multi-file refactors or test-generation tasks.
- Use **GLM-4.7** or high-quality variants (`glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) on GB300 Ultra for best quality
- Use larger context (e.g. 64K198K) for big codebases
- Use Claude Code on multi-file refactors or test-generation tasks
@ -345,15 +270,18 @@ spec:
| Symptom | Cause | Fix |
|---------|-------|-----|
| `ollama: command not found` | Ollama not installed or PATH not updated | Rerun `curl -fsSL https://ollama.com/install.sh | sh` and open a new shell |
| Model load fails with version error | Ollama is older than 0.15.0 | Update Ollama to 0.15.0 or newer (required for GLM-4.7-Flash). Do not pin to 0.14.3. |
| `model not found` in Claude Code | Model was not pulled | Run `ollama pull glm-4.7-flash` or `ollama pull hf.co/unsloth/GLM-4.7-GGUF:Q8_0` and retry. Use the same model name in `claude --model ...`. |
| `connection refused` to localhost:11434 | Ollama service not running | Start with `ollama serve` or `sudo systemctl start ollama` |
| Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station (NVIDIA GB300), ensure no other heavy GPU workloads. If OOM persists, use a smaller variant (e.g. `glm-4.7-flash:q8_0` or `glm-4.7-flash:q4_K_M`) or `OLLAMA_MAX_LOADED_MODELS=1`. |
| `claude: command not found` after install | CLI not on PATH or install script did not complete | Restart the terminal or run `source ~/.bashrc` (or your shell profile). Check the install script output for the install path and add it to PATH. |
| Claude Code install fails (Node.js / network) | Node.js missing or install script cannot download | Ensure Node.js is installed (`node --version`). If the install script fails with a network error, retry from a stable connection or download the Claude Code CLI from the official site. See [Claude Code documentation](https://claude.ai/docs) for alternatives. |
| Model load fails with version error | Ollama is older than 0.14.3 | Update Ollama to 0.14.3 or newer |
| `model not found` in Claude Code | Model was not pulled | Run `ollama pull glm-4.7-flash` or `ollama pull glm-4.7` and retry |
| `opencode: command not found` | OpenCode not installed or PATH not updated | Install OpenCode and open a new shell |
| OpenCode cannot reach Ollama | `baseURL` misconfigured or Ollama not running | Set `baseURL` to `http://localhost:11434/v1` and start Ollama |
| `codex: command not found` | Codex CLI not installed or PATH not updated | Install Codex CLI and open a new shell |
| Codex CLI uses the wrong model/provider | `~/.codex/config.toml` not pointing to Ollama | Set `model_provider = "ollama"` and `base_url = "http://localhost:11434/v1"` |
| `connection refused` to localhost:11434 | Ollama service not running | Start with `ollama serve` or `systemctl start ollama` |
| Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station GB300 Ultra, ensure no other heavy GPU workloads. If OOM persists, use a smaller variant (e.g. `glm-4.7-flash:q8_0` or `glm-4.7-flash:q4_K_M`) or `OLLAMA_MAX_LOADED_MODELS=1`. |
> [!NOTE]
> DGX Station with **NVIDIA GB300** provides ample GPU memory for **glm-4.7-flash** (fast testing) and **unsloth/GLM-4.7-GGUF:Q8_0** (best quality), plus variants (e.g. `glm-4.7-flash:bf16`). Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models.
> DGX Station with GB300 Ultra provides ample GPU memory for **GLM-4.7** and **GLM-4.7-Flash** in high-quality
> variants (e.g. `glm-4.7-flash:bf16`). Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models.
@ -363,15 +291,31 @@ spec:
url: https://ollama.com/docs
- name: GLM-4.7-Flash
- name: GLM-4.7-Flash (Ollama)
url: https://ollama.com/library/glm-4.7-flash
- name: Unsloth GLM-4.7-GGUF
url: https://huggingface.co/unsloth/GLM-4.7-GGUF
- name: GLM-4.7 (Ollama)
url: https://ollama.com/library/glm-4.7
- name: Claude Code + Ollama Guide
url: https://ollama.com/blog/claude
- name: OpenCode Ollama Provider
url: https://opencode.ai/docs/providers/#ollama
- name: Codex + Ollama Guide
url: https://ollama.com/blog/codex
- name: DGX Station Documentation
url: https://docs.nvidia.com/dgx/dgx-station
- name: DGX Station Forum
url: https://forums.developer.nvidia.com/c/accelerated-computing/dgx-station

View File

@ -54,7 +54,7 @@ spec:
- **Enable MIG** on all B300 GPUs or on a per-GPU basis.
- **Create a MIG layout** using B300 profile IDs (with a known-good example for multiple GPUs).
- **Verify** the layout with `nvidia-smi -L` and `sudo nvidia-smi mig -lgi` / `-lci`.
- **Verify** the layout with `nvidia-smi -L` and `nvidia-smi mig -lgi` / `-lci`.
- **Run workloads** by setting `CUDA_VISIBLE_DEVICES` to a MIG UUID or by using the container/Kubernetes flows from the MIG User Guide.
- **Disable MIG** when you need full-GPU mode and NVLink again.
@ -73,7 +73,7 @@ spec:
**Software:**
- NVIDIA driver and `nvidia-smi` installed and working: `nvidia-smi`. Use a driver version that supports MIG on B300 (see [Troubleshooting](troubleshooting.md) for version guidance; if `nvidia-smi -mig 1` reports "MIG mode not supported" or similar, the driver may be too old).
- NVIDIA driver and `nvidia-smi` installed and working: `nvidia-smi`
- Root or sudo access to run `nvidia-smi -mig 1`, `-mig 0`, and `nvidia-smi mig -cgi ... -C`
- For containers/K8s: nvidia-container-toolkit and MIG support as described in the [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/)
@ -81,15 +81,14 @@ spec:
This playbook does not use repository assets; all steps use `nvidia-smi` and MIG commands on the DGX Station. For container and Kubernetes setup, use the official [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) (Getting Started with MIG and Kubernetes sections).
# Time & risk
- **Estimated time:** About 15 minutes to enable MIG, create a layout, and verify. Layout design (which profiles per GPU) may take longer if you customize.
- **Risk level:** Low to Medium
- Enabling or disabling MIG requires sudo and affects all workloads on that GPU.
- Disabling MIG removes all MIG instances; ensure Fabric Manager is running on DGX/HGX B200/B300 so NVLink/NVSwitch re-initialize correctly.
- **Rollback:** Destroy all MIG instances with `sudo nvidia-smi mig -dci -i N` and `sudo nvidia-smi mig -dgi -i N` for each GPU index N, then run `sudo nvidia-smi -mig 0` to disable MIG and return to a single full-GPU instance per GB300. Ensure **Fabric Manager** is running after disabling MIG: `sudo systemctl status nvidia-fabricmanager` (start if needed: `sudo systemctl start nvidia-fabricmanager`).
- **Last Updated:** 03/02/2026
- **Rollback:** Run `sudo nvidia-smi -mig 0` to disable MIG and return to a single full-GPU instance per B300.
- **Last Updated:** February 2025
- First publication.
@ -101,26 +100,18 @@ spec:
content: |
# Step 1. Prerequisites and verify B300 GPUs
Ensure your DGX Station has B300 GPUs (GB300 Ultra), a supported NVIDIA driver (see [Troubleshooting](troubleshooting.md) for driver requirements), and that `nvidia-smi` is available. You need root or sudo to enable MIG and create instances.
**Before enabling MIG:** All GPU processes must be stopped. Desktop environments (e.g. GNOME, Xwayland), NVIDIA services (e.g. nvsm_core, nvidia-pe, nv-hostengine), or workloads like vLLM can hold the GPU and cause "In use by another client" when you run MIG commands. Check what is using the GPUs:
```bash
sudo fuser -v /dev/nvidia*
```
Stop or suspend any processes that are using the GPUs before proceeding to Step 2.
Ensure your DGX Station is running with B300 GPUs (GB300 Ultra) and that the NVIDIA driver and `nvidia-smi` are available. You need root or sudo to enable MIG and create instances.
```bash
nvidia-smi
nvidia-smi -L
```
Expected output should list one or more **NVIDIA GB300** devices. If you see GB300 GPUs, you can proceed to enable MIG.
Expected output should list one or more **NVIDIA B300** devices (e.g. `NVIDIA B300 SXM6 AC`). If you see B300 GPUs, you can proceed to enable MIG.
# Step 2. Enable MIG mode on the B300 GPUs
Ensure no GPU processes are running (see Step 1). Enable MIG for all GPUs or for a specific GPU. This must be done with elevated privileges.
Enable MIG for all GPUs in the system or for a specific GPU. This must be done with elevated privileges.
**Enable MIG on all GPUs:**
@ -134,10 +125,6 @@ spec:
sudo nvidia-smi -i 0 -mig 1
```
**Expected output:** Success typically shows no error message; the command returns to the prompt. If you see "In use by another client", stop all GPU processes (e.g. desktop, services, containers) and run `sudo fuser -v /dev/nvidia*` to confirm nothing is using the GPUs, then retry.
If MIG mode shows **Pending** after enablement (e.g. in `nvidia-smi -q | grep -i mig`), wait a short time and run the command again, or reboot the system to allow the driver to apply the MIG state.
Enabling MIG partitions each B300 into multiple GPU Instances; you will create and assign profiles in the next steps.
# Step 3. Verify MIG mode and inspect B300 profiles
@ -158,15 +145,15 @@ spec:
nvidia-smi mig -lgip -i 0
```
On GB300 you should see profiles such as (exact memory sizes may match your driver; IDs are used in commands):
On B300 you should see profiles such as:
- MIG 1g.35gb (ID 19)
- MIG 1g.35gb+me (ID 20)
- MIG 1g.70gb (ID 15)
- MIG 2g.70gb (ID 14)
- MIG 3g.139gb (ID 9)
- MIG 4g.139gb (ID 5)
- MIG 7g.278gb (ID 0)
- MIG 1g.34gb (ID 19)
- MIG 1g.34gb+me (ID 20)
- MIG 1g.67gb (ID 15)
- MIG 2g.67gb (ID 14)
- MIG 3g.135gb (ID 9)
- MIG 4g.135gb (ID 5)
- MIG 7g.269gb (ID 0)
Note the **IDs**; you will pass them to `-cgi` when creating the layout.
@ -178,29 +165,29 @@ spec:
sudo nvidia-smi mig -cgi <profile_id,profile_id,...> -C -i <gpu_index>
```
This example assumes a **6-GPU** DGX Station. If you have fewer GPUs (e.g. 1 or 2), run only the `-cgi` lines for the GPU indices that exist on your system (e.g. `-i 0` and `-i 1` only). Each GPU can have any combination of profiles that fits within its capacity:
Example layout for a 6-GPU DGX Station (adjust GPU indices and counts to match your system). Each GPU can have any combination of profiles that fits within its capacity:
```bash
# GPU 0: 7 × 1g.35gb
# GPU 0: 7 × 1g.34gb
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 -C -i 0
# GPU 1: 4 × 1g.70gb
# GPU 1: 4 × 1g.67gb
sudo nvidia-smi mig -cgi 15,15,15,15 -C -i 1
# GPU 2: 3 × 2g.70gb
# GPU 2: 3 × 2g.67gb
sudo nvidia-smi mig -cgi 14,14,14 -C -i 2
# GPU 3: 2 × 3g.139gb
# GPU 3: 2 × 3g.135gb
sudo nvidia-smi mig -cgi 9,9 -C -i 3
# GPU 4: 1 × 4g.139gb
# GPU 4: 1 × 4g.135gb
sudo nvidia-smi mig -cgi 5 -C -i 4
# GPU 5: 1 × 7g.278gb (full GPU as a single MIG instance)
# GPU 5: 1 × 7g.269gb (full GPU as a single MIG instance)
sudo nvidia-smi mig -cgi 0 -C -i 5
```
You can choose any valid combination of profile IDs per GPU that fits within the GB300s capacity; the above is a known-good example.
You can choose any valid combination of profile IDs per GPU that fits within the B300s capacity; the above is a known-good example. If your DGX Station has fewer than 6 GPUs, run only the `-i <N>` commands for GPUs that exist (e.g. 0 and 1 only).
# Step 5. Verify MIG instances
@ -210,23 +197,23 @@ spec:
nvidia-smi -L
```
You should see each physical GPU (e.g. **NVIDIA GB300**) followed by its MIG devices, for example:
You should see each physical **NVIDIA B300 SXM6 AC** followed by its MIG devices, for example:
```
GPU 0: NVIDIA GB300 (UUID: GPU-...)
MIG 1g.35gb Device 0: (UUID: MIG-...)
MIG 1g.35gb Device 1: (UUID: MIG-...)
GPU 0: NVIDIA B300 SXM6 AC (UUID: GPU-...)
MIG 1g.34gb Device 0: (UUID: MIG-...)
MIG 1g.34gb Device 1: (UUID: MIG-...)
...
GPU 1: NVIDIA GB300 (UUID: GPU-...)
MIG 1g.70gb Device 0: (UUID: MIG-...)
GPU 1: NVIDIA B300 SXM6 AC (UUID: GPU-...)
MIG 1g.67gb Device 0: (UUID: MIG-...)
...
```
To list GPU instances and compute instances (requires sudo):
To list GPU instances and compute instances:
```bash
sudo nvidia-smi mig -lgi # list GPU instances
sudo nvidia-smi mig -lci # list compute instances
nvidia-smi mig -lgi # list GPU instances
nvidia-smi mig -lci # list compute instances
```
# Step 6. Using the MIG devices
@ -238,60 +225,20 @@ spec:
./your_app
```
**Verify a MIG instance is visible:** From the same shell where you set `CUDA_VISIBLE_DEVICES`, run `nvidia-smi`. You should see only the single MIG device (e.g. one "MIG 1g.35gb" device). Example:
```bash
export CUDA_VISIBLE_DEVICES=MIG-<uuid-from-nvidia-smi-L>
nvidia-smi
```
**Containers (Docker):** Use the MIG device UUID in the `--gpus` option. Example:
```bash
docker run --gpus '"device=MIG-<uuid>"' nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi
```
Replace `<uuid>` with a full MIG UUID from `nvidia-smi -L`. For Kubernetes and nvidia-container-toolkit workflows, see the [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) (Getting Started with MIG and Kubernetes sections).
**Containers and Kubernetes:** use the NVIDIA MIG User Guide “Getting Started with MIG” and the Kubernetes sections. They cover the nvidia-container-toolkit, device plugin, and nvidia-mig-manager workflows for exposing MIG instances to containers.
# Step 7. Disabling MIG and restoring full GPU
When you need full NVLink P2P and a single full-GPU instance again, you must **destroy all MIG instances first**, then disable MIG. If you run `sudo nvidia-smi -mig 0` without destroying instances, it will fail with "In use by another client."
**1. Destroy compute instances and GPU instances on each GPU.** For each GPU index that has MIG instances, run (replace `N` with the GPU index, e.g. 0, 1, … 5 for a 6-GPU system):
```bash
# Destroy all compute instances on GPU N (required before destroying GPU instances)
sudo nvidia-smi mig -dci -i N
# Destroy all GPU instances on GPU N
sudo nvidia-smi mig -dgi -i N
```
Repeat for every GPU that has MIG instances. Example for a 6-GPU system:
```bash
for i in 0 1 2 3 4 5; do sudo nvidia-smi mig -dci -i $i; sudo nvidia-smi mig -dgi -i $i; done
```
**2. Disable MIG mode on all GPUs:**
When you need full NVLink P2P and a single full-GPU instance again, disable MIG on all GPUs:
> [!WARNING]
> This returns each GB300 to a single full-GPU instance. Any workloads using MIG UUIDs must be stopped first and will need to be reconfigured or restarted.
> This removes all MIG instances and returns each B300 to a single full-GPU instance. Any workloads using MIG UUIDs will need to be reconfigured or restarted.
```bash
sudo nvidia-smi -mig 0
```
**3. Verify MIG is fully disabled:**
```bash
nvidia-smi -q | grep -A2 "MIG Mode"
```
Expected output should show `Current: Disabled` for each GPU.
On DGX/HGX B200/B300, ensure **Fabric Manager** is running after disabling MIG so NVLinks and NVSwitch fabric are re-initialized (see [Troubleshooting](troubleshooting.md)).
This resets the GPUs. On DGX/HGX B200/B300, ensure **Fabric Manager** is running so that NVLinks and NVSwitch fabric routing are re-initialized after MIG is disabled.
@ -302,50 +249,12 @@ spec:
content: |
| Symptom | Cause | Fix |
|--------|--------|-----|
| `nvidia-smi -mig 1` fails or "MIG mode not supported" | Driver too old or GPU not MIG-capable | Use a driver version that supports MIG on GB300 (see [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) for supported versions). Check `nvidia-smi -q` for driver and GPU model. Update the driver if it is too old. |
| "In use by another client" when running `-mig 1`, `-cgi`, or `-mig 0` | GPU is held by another process or MIG instances still exist | **For enable/create:** Stop all GPU processes (desktop, VLLM, nvsm_core, nvidia-pe, nv-hostengine, etc.). Run `sudo fuser -v /dev/nvidia*` to see what is using the GPUs; stop those processes and retry. **For disable:** You must destroy all MIG instances first: run `sudo nvidia-smi mig -dci -i N` then `sudo nvidia-smi mig -dgi -i N` for each GPU index N that has instances, then run `sudo nvidia-smi -mig 0`. |
| `nvidia-smi mig -cgi ... -C -i N` fails (e.g. "Invalid combination") | Profile combination exceeds GPU capacity or invalid IDs | Run `nvidia-smi mig -lgip -i N` and use only listed profile IDs. Ensure the sum of instance sizes does not exceed the GB300's capacity for that GPU. |
| MIG instances not visible after creation | Instances not created or wrong GPU index | Run `nvidia-smi -L` and `sudo nvidia-smi mig -lgi` to confirm. Re-run the `-cgi` commands for the correct `-i <gpu_index>`. |
| App doesn't see MIG device when using CUDA_VISIBLE_DEVICES=MIG-&lt;uuid&gt; | Wrong UUID or app not using CUDA_VISIBLE_DEVICES | Get UUIDs from `nvidia-smi -L`. Export `CUDA_VISIBLE_DEVICES=MIG-<uuid>` in the same shell before launching the app. |
| "Insufficient Permissions" when running `nvidia-smi mig -lgi` or `-lci` | Listing instances requires root | Use `sudo nvidia-smi mig -lgi` and `sudo nvidia-smi mig -lci`. |
| After `nvidia-smi -mig 0`, NVLink or fabric issues on DGX/HGX | Fabric Manager not re-initializing | Ensure Fabric Manager is running after disabling MIG: `sudo systemctl status nvidia-fabricmanager`; start if needed with `sudo systemctl start nvidia-fabricmanager`. See [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) for details. |
| Permission denied when running nvidia-smi -mig or mig -cgi | Need root for MIG operations | Use `sudo` for `nvidia-smi -mig 1/0`, `nvidia-smi mig -cgi ... -C`, `-dci`, and `-dgi`. |
## MIG reconfiguration (day-2 operations)
To change the MIG layout (e.g. add or remove instances, or switch profiles), destroy existing instances on the affected GPU(s), then create the new layout:
1. **Destroy compute instances and GPU instances** on each GPU you want to reconfigure (replace `N` with the GPU index):
```bash
sudo nvidia-smi mig -dci -i N
sudo nvidia-smi mig -dgi -i N
```
2. **Create the new layout** with `sudo nvidia-smi mig -cgi <profile_ids> -C -i N` as in the Instructions (Step 4).
Workloads using the old MIG UUIDs must be stopped before destroying instances; they will need to be restarted with the new UUIDs from `nvidia-smi -L` after recreation.
## Profile selection guidance
| Profile (typical name) | Use case |
|------------------------|----------|
| 1g.35gb (ID 19) | Small inference, dev/test, many concurrent small jobs |
| 1g.70gb (ID 15) | Slightly larger inference or light training |
| 2g.70gb (ID 14) | Medium inference or small training |
| 3g.139gb (ID 9) | Larger inference or medium training |
| 4g.139gb (ID 5) | Heavy inference or moderate training |
| 7g.278gb (ID 0) | Full-GPU as single MIG instance; max memory per partition |
Exact profile names may vary by driver (e.g. 1g.34gb vs 1g.35gb); use the **profile IDs** from `nvidia-smi mig -lgip -i 0` in your `-cgi` commands.
## Post-disable verification
After running `sudo nvidia-smi -mig 0`, confirm MIG is fully disabled:
```bash
nvidia-smi -q | grep -A2 "MIG Mode"
```
Expected output should show `Current: Disabled` for each GPU. If you still see MIG devices in `nvidia-smi -L`, destroy any remaining instances with `-dci`/`-dgi` per GPU, then run `-mig 0` again
| `nvidia-smi -mig 1` fails or "MIG mode not supported" | Driver too old or GPU not MIG-capable | Ensure you have a B300 (or other MIG-capable GPU) and a driver version that supports MIG on B300. Check `nvidia-smi -q` and [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) for supported hardware/driver. |
| `nvidia-smi mig -cgi ... -C -i N` fails (e.g. "Invalid combination") | Profile combination exceeds GPU capacity or invalid IDs | Run `nvidia-smi mig -lgip -i N` and use only listed profile IDs. Ensure the sum of instance sizes does not exceed the B300s capacity for that GPU. |
| MIG instances not visible after creation | Instances not created or wrong GPU index | Run `nvidia-smi -L` and `nvidia-smi mig -lgi` to confirm. Re-run the `-cgi` commands for the correct `-i <gpu_index>`. |
| App doesnt see MIG device when using CUDA_VISIBLE_DEVICES=MIG-&lt;uuid&gt; | Wrong UUID or app not using CUDA_VISIBLE_DEVICES | Get UUIDs from `nvidia-smi -L`. Export `CUDA_VISIBLE_DEVICES=MIG-<uuid>` in the same shell before launching the app. |
| After `nvidia-smi -mig 0`, NVLink or fabric issues on DGX/HGX | Fabric Manager not re-initializing | On DGX/HGX B200/B300, ensure Fabric Manager is running after disabling MIG so NVLinks and NVSwitch fabric are re-initialized. |
| Permission denied when running nvidia-smi -mig or mig -cgi | Need root for MIG operations | Use `sudo` for `nvidia-smi -mig 1/0` and `nvidia-smi mig -cgi ... -C`. |

View File

@ -82,7 +82,6 @@ spec:
df -h .
```
# Time & risk
* **Estimated duration**: 45-90 minutes depending on network speed and model size
@ -91,8 +90,6 @@ spec:
* Quantization process is memory-intensive and may fail on systems with insufficient GPU memory
* Output files are large (several GB) and require adequate storage space
* **Rollback**: Remove the output directory and any pulled Docker images to restore original state.
* **Last Updated:** 03/02/2026
* First Publication
@ -164,7 +161,7 @@ spec:
In this example, the GB300 is device **1**. Note this number for use in Docker commands.
> [!NOTE]
> The examples below assume the GB300 is device 1. If your GPU has a different ID, adjust the `--gpus "device=X"` parameter in the Docker commands accordingly.
> The examples below assume the GB300 is device 1. If your GPU has a different ID, adjust the `--gpus '"device=X"'` parameter in the Docker commands accordingly.
# Step 5. Run the quantization process using TensorRT Model Optimizer
@ -194,7 +191,7 @@ spec:
This command:
- Runs the container with access to the specified GPU (device 1) and optimized shared memory settings
- Runs the container with full GPU access and optimized shared memory settings
- Mounts your output directory to persist quantized model files
- Mounts your Hugging Face cache to avoid re-downloading the model
- Clones and installs the TensorRT Model Optimizer from source
@ -232,7 +229,7 @@ spec:
-e HF_TOKEN=$HF_TOKEN \
-v "$MODEL_PATH:/workspace/model" \
--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
--gpus "device=1" --ipc=host --network host \
--gpus '"device=1"' --ipc=host --network host \
nvcr.io/nvidia/vllm:25.12.post1-py3 \
vllm serve /workspace/model \
--max-model-len 4096 \
@ -255,7 +252,7 @@ spec:
-e HF_TOKEN=$HF_TOKEN \
-v "$MODEL_PATH:/workspace/model" \
--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
--gpus "device=1" --ipc=host --network host \
--gpus '"device=1"' --ipc=host --network host \
nvcr.io/nvidia/vllm:25.12.post1-py3 \
vllm serve /workspace/model \
--backend pytorch \
@ -264,13 +261,13 @@ spec:
--port 8000
```
When serving from a local path, vLLM exposes the model name as the path's last component (here, `model`). Run the following to test the server (use the same model name vLLM reports, e.g. from `curl http://localhost:8000/v1/models`):
Run the following to test the server with a client CURL request:
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "model",
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"messages": [{"role": "user", "content": "What is artificial intelligence?"}],
"max_tokens": 100,
"temperature": 0.7,

View File

@ -32,7 +32,7 @@ spec:
cta:
text: View on GitHub
url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-topic-modeling/
url: https://github.com/NVIDIA/dgx-station-playbooks/blob/main/nvidia/station-topic-modeling/
tabs:
@ -80,11 +80,10 @@ spec:
# Ancillary files
All required assets are in the playbook directory `nvidia/station-topic-modeling/assets` (see [Instructions](https://build.nvidia.com/station/topic-modeling/instructions), Step 7). Key file:
All required assets are in the playbook directory `nvidia/station-topic-modeling/assets` (see Step 7). Key file:
- `video_notebook_for_GPU_Accelerated_Machine_Learning_BERTopic_RTX6000_40M.ipynb` - Complete Jupyter notebook with GPU-accelerated topic modeling pipeline (filename reflects original demo hardware; the notebook runs on GB300 and other NVIDIA GPUs)
# Time & risk
* **Estimated time:** 45 minutes (includes environment setup, dataset download, and embedding generation)
@ -92,7 +91,7 @@ spec:
* Large dataset download (~14GB) may take time depending on network speed
* Embedding generation requires significant GPU memory
* **Rollback:** Delete the downloaded dataset and any generated embedding files to restore state
* **Last Updated:** 03/02/2026
* **Last Updated:** 02/05/2026
* First Publication
@ -135,7 +134,6 @@ spec:
# Step 4. Install machine learning packages
Install UMAP, HDBSCAN, BERTopic, and supporting libraries for topic modeling.
Note: `datamapplot` will upgrade dask/distributed — the next step pins them back.
```bash
pip install \
@ -144,15 +142,7 @@ spec:
scikit-learn==1.4.2 datamapplot
```
Pin dask/distributed back to RAPIDS-compatible versions:
```bash
pip install "dask==2025.9.1" "distributed==2025.9.1"
```
These packages provide:
- **dask**: Parallel computing library
- **distributed**: Distributed task scheduler for dask
- **sentence-transformers**: Generate text embeddings
- **umap-learn / hdbscan**: Dimensionality reduction and clustering (GPU-accelerated via cuML)
- **bertopic**: Topic modeling framework
@ -196,8 +186,8 @@ spec:
Clone the playbook repository and download the Amazon Electronics Reviews dataset.
```bash
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-topic-modeling/assets
git clone https://github.com/NVIDIA/dgx-station-playbooks
cd dgx-station-playbooks/nvidia/station-topic-modeling/assets
```
Download the dataset (~14GB compressed):
@ -206,17 +196,7 @@ spec:
wget https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/raw/review_categories/Electronics.jsonl.gz
```
# Step 8. Pull Git LFS files (notebooks)
The notebook files are stored in Git LFS — without this step, JupyterLab will throw a `NotJSONError` when trying to open them.
```bash
conda install -c conda-forge git-lfs
git lfs install
git lfs pull
```
# Step 9. Launch JupyterLab
# Step 8. Launch JupyterLab
Start JupyterLab from the assets directory:
@ -224,13 +204,13 @@ spec:
jupyter lab
```
# Step 10. Select the rapids-25.10 kernel
# Step 9. Select the rapids-25.10 kernel
In JupyterLab, open the notebook `video_notebook_for_GPU_Accelerated_Machine_Learning_BERTopic_1M.ipynb`.
Select the **rapids-25.10** kernel from the kernel selector in the top right corner of the notebook interface.
# Step 11. Execute all cells
# Step 10. Execute all cells
Run all cells in the notebook sequentially. The notebook will:
@ -241,7 +221,7 @@ spec:
5. **Run BERTopic**: Cluster documents into topics using GPU-accelerated UMAP and HDBSCAN
6. **Visualize results**: Generate interactive topic visualizations
# Step 12. Explore the results
# Step 11. Explore the results
After the notebook completes, you'll have:
@ -251,7 +231,7 @@ spec:
- **Heatmap**: Topic similarity matrix
- **Document datamap**: Visual clustering of documents by topic
# Step 13. Cleanup (optional)
# Step 12. Cleanup (optional)
Remove the conda environment when finished:
@ -266,16 +246,6 @@ spec:
rm Electronics.jsonl.gz
```
Remove generated embedding files and the cloned playbook directory if you no longer need them:
```bash
# Optional: remove Hugging Face cache (embedding cache from the notebook)
rm -rf ~/.cache/huggingface
# From the parent of dgx-spark-playbooks/, remove the cloned repo
rm -rf dgx-spark-playbooks/
```
# Next steps
Apply this workflow to your own datasets:
@ -288,6 +258,31 @@ spec:
-
id: troubleshooting
label: Troubleshooting
content: |
# Common issues
| Symptom | Cause | Fix |
|---------|-------|-----|
| "Permission denied" on `~/.cache/huggingface` or Hugging Face download fails | Cache dir owned by root or wrong permissions | Run `sudo chown -R $USER:$USER $HOME/.cache/huggingface` and `sudo chmod -R u+rwX $HOME/.cache/huggingface` (use your username if different). |
| `PackagesNotFoundError` for `jupyterlab-widgets` with conda | Package not available for platform/channel | Install with pip: `pip install jupyterlab-widgets`. |
| Pip reports dependency conflicts (dask, distributed, cuml, rapids-dask-dependency) after installing BERTopic stack | Pip downgrades dask/distributed; RAPIDS expects newer versions | BERTopic and the notebook typically still work. To avoid conflicts, install BERTopic/umap/hdbscan in a separate env, or accept the conflict if you do not need cuML + dask together. |
| `CUDA out of memory` error during embedding generation | Insufficient GPU memory for batch size | Reduce batch size in `model.encode()` or process fewer documents by lowering `nrows` |
| `ModuleNotFoundError: No module named 'cuml'` | cuML not installed or wrong environment | Verify `conda activate rapids-25.10` and run `%load_ext cuml.accel` before imports |
| Notebook kernel dies during UMAP | Out of memory during dimensionality reduction | Reduce dataset size or use `low_memory=True` in UMAP parameters |
| `wget` download fails or hangs | Network issues or firewall blocking | Check internet connection, try with `--retry-connrefused --waitretry=1 --read-timeout=20` |
| Kernel not found in JupyterLab | rapids-25.10 kernel not registered | Run `python -m ipykernel install --user --name rapids-25.10` |
| `cudf.pandas` not accelerating operations | Extension not loaded before pandas import | Restart kernel and ensure `%load_ext cudf.pandas` runs before `import pandas` |
| Topic model produces too many/few topics | HDBSCAN parameters need tuning | Adjust `min_cluster_size` (larger = fewer topics) and `min_samples` |
| Plotly visualizations not rendering | Renderer not configured for JupyterLab | Add `pio.renderers.default = "notebook"` after importing plotly |
| `ResolvePackageNotFound` during conda install | Package version conflict or missing channel | Ensure `-c rapidsai -c conda-forge` channels are specified |
| PyTorch not using GPU | Wrong PyTorch version or CUDA mismatch | Reinstall with correct CUDA version: `pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/cu130` |
resources:
- name: BERTopic Documentation

View File

@ -35,7 +35,7 @@ spec:
cta:
text: View on GitHub
url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-txt2kg/
url: https://github.com/NVIDIA/dgx-station-playbooks/blob/main/nvidia/station-txt2kg/
tabs:
@ -81,13 +81,12 @@ spec:
# Ancillary files
All required assets are in the playbook directory `nvidia/station-txt2kg/assets` (see Instructions, Step 1). Key files:
All required assets are in the playbook directory `nvidia/station-txt2kg/assets` (see Step 1). Key files:
- `start.sh` - Launch script for all services
- `stop.sh` - Stop script to shut down services
- `deploy/compose/` - Docker Compose configurations
# Time & risk
- **Duration**:
@ -100,8 +99,8 @@ spec:
- Document processing time scales with document size and complexity
- **Rollback**: Stop and remove Docker containers, delete downloaded models if needed
* **Last Updated:** 03/02/2026
* First Publication
- **Last Updated**: 02/06/2026
- First Publication
@ -115,61 +114,63 @@ spec:
This playbook is for **DGX Station**. In a terminal, clone the repository and navigate to the project directory.
```bash
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-txt2kg/assets
git clone https://github.com/NVIDIA/dgx-station-playbooks
cd dgx-station-playbooks/nvidia/station-txt2kg/assets
```
# Step 2. Start the txt2kg services
The default backend is **vLLM** (supported on DGX Station). The script starts services and waits for the vLLM backend to be ready (model load can take 30+ minutes; progress is shown in the terminal). To use Ollama instead, run `./start.sh --ollama`.
Use the provided start script to launch all required services. On DGX Station, if the default backend (Ollama) does not work, use the vLLM backend: `./start.sh --vllm`.
```bash
./start.sh
# Optional: ./start.sh --ollama # Use ArangoDB + Ollama instead of vLLM
# Optional: ./start.sh --no-wait # Skip waiting for vLLM readiness
# If the default backend fails: ./start.sh --vllm
```
The script will:
The script will automatically:
- Check for GPU availability
- Start Docker Compose services (Neo4j + vLLM by default)
- Wait for vLLM to be ready and show elapsed time
- Print the Web UI URL when ready
- Start Docker Compose services
- Set up ArangoDB database
- Launch the web interface
# Step 3. Pull the model (Ollama only)
# Step 3. Pull the Llama 3.1 405B model
If you started with **Ollama** (`./start.sh --ollama`), pull the Llama model:
The default configuration uses Llama 3.1 405B, which leverages the GB300 Ultra's large GPU memory for maximum accuracy in knowledge extraction:
```bash
docker exec ollama-compose ollama pull llama3.1:405b
```
Browse available models at [https://ollama.com/search](https://ollama.com/search). With the default **vLLM** stack, the model is loaded automatically by the vLLM container.
Browse available models at [https://ollama.com/search](https://ollama.com/search)
> [!NOTE]
> The first model download may take 20-30 minutes depending on network speed. For faster initial testing, you can use `llama3.1:70b` or `llama3.1:8b` as alternatives.
# Step 4. Access the web interface
> [!NOTE]
> If you started with **vLLM** (`./start.sh --vllm`), the vLLM backend can take **30 minutes or more** to load the model and initialize. There may be no progress indicator in the CLI or web UI during this time; check container logs with `docker logs` to confirm the server is still loading.
Open your browser and navigate to:
```
http://localhost:3001
```
You can also access:
- **Neo4j Browser** (vLLM default): http://localhost:7474
- **vLLM API**: http://localhost:8001
- **ArangoDB** (Ollama only): http://localhost:8529
- **Ollama API** (Ollama only): http://localhost:11434
You can also access individual services:
- **ArangoDB Web Interface**: http://localhost:8529
- **Ollama API**: http://localhost:11434
# Step 5. Upload documents and build knowledge graphs
The web UI defaults to **local** (vLLM or Ollama). If the backend is still loading, a banner and the model selector will show “Initializing…” until the backend is ready.
### 5.1. Document Upload
- Use the web interface to upload text documents (markdown, text, CSV supported)
- Documents are automatically chunked and processed for triple extraction
### 5.2. Knowledge Graph Generation
- The system extracts subject-predicate-object triples using the selected LLM (vLLM or Ollama)
- Triples are stored in Neo4j (vLLM) or ArangoDB (Ollama) for relationship querying
- The system extracts subject-predicate-object triples using Ollama
- Triples are stored in ArangoDB for relationship querying
### 5.3. Interactive Visualization
- View your knowledge graph in 2D or 3D with GPU-accelerated rendering
@ -184,28 +185,26 @@ spec:
# Step 6. Cleanup and rollback
Stop all services (use the same flags as when you started):
Remove downloaded models while the container is still running, then stop services:
```bash
# Stop services (default: vLLM stack)
./stop.sh
# If you started with Ollama: ./stop.sh --ollama
# Remove downloaded models (optional; run before stopping containers)
docker exec ollama-compose ollama rm llama3.1:405b
# Stop services
docker compose down
# Remove containers and volumes (optional)
# From assets dir: docker compose -f deploy/compose/docker-compose.vllm.yml down -v
# Or with Ollama: docker compose -f deploy/compose/docker-compose.yml down -v
# Remove downloaded Ollama models (Ollama only)
# docker exec ollama-compose ollama rm llama3.1:405b
docker compose down -v
```
# Step 7. Next steps
- Default is vLLM on DGX Station; use `./start.sh --ollama` for ArangoDB + Ollama.
- The UI shows a readiness banner and “vLLM (Local) Initializing…” until the backend is ready.
- Experiment with different models for extraction quality and speed tradeoffs.
- Customize triple extraction prompts for domain-specific knowledge.
- Explore advanced graph querying and visualization features.
- On DGX Station, use `./start.sh --vllm` if the default Ollama backend does not work; allow 30+ minutes for vLLM to initialize.
- Experiment with different Ollama models for varied extraction quality and speed tradeoffs
- The 405B model provides the highest accuracy; use 70B or 8B for faster processing
- Customize triple extraction prompts for domain-specific knowledge
- Explore advanced graph querying and visualization features
@ -224,8 +223,8 @@ spec:
| ArangoDB connection refused | Service not fully started | Wait 30s after start.sh, verify with `docker ps` |
| Container fails to start with GPU error | NVIDIA Container Toolkit not configured | Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker |
| Port already in use | Previous instance still running | Run `./stop.sh` first or use `docker compose down` |
| Default is vLLM; need Ollama instead | Prefer ArangoDB + Ollama | Start with `./start.sh --ollama`. |
| vLLM takes long to become ready | Model load can take 30+ minutes | The start script waits and shows elapsed time. The UI shows a banner and "vLLM (Local) Initializing…" until ready. Check progress: `docker logs vllm-service -f`. |
| Default backend (Ollama) doesn't work on DGX Station | Backend or model not available | Start with vLLM: `./start.sh --vllm`. Allow 30+ minutes for vLLM to load the model; there may be no progress message in the UI. |
| No feedback while vLLM is starting | vLLM model load takes a long time | vLLM can take >30 minutes to initialize. Check `docker logs` for the vLLM container to confirm it is still loading. |
> [!NOTE]
> DGX Station with GB300 Ultra provides massive GPU memory capacity, enabling you to run larger models (70B+)

View File

@ -45,6 +45,8 @@ The following models are supported with vLLM on DGX Station. All listed models a
| Model | Quantization | Support Status | HF Handle |
|-------|-------------|----------------|-----------|
| **DiffusionGemma 26B A4B IT** | BF16 | ✅ | [`google/diffusiongemma-26B-A4B-it`](https://huggingface.co/google/diffusiongemma-26B-A4B-it) |
| **DiffusionGemma 26B A4B IT** | NVFP4 | ✅ | [`nvidia/diffusiongemma-26B-A4B-it-NVFP4`](https://huggingface.co/nvidia/diffusiongemma-26B-A4B-it-NVFP4) |
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
@ -54,7 +56,7 @@ The following models are supported with vLLM on DGX Station. All listed models a
* **Duration:** 30 minutes (longer on first run due to model download)
* **Risks:** Model download requires HuggingFace authentication
* **Rollback:** Stop and remove the container to restore state
* **Last Updated:** 05/28/2026
* **Last Updated:** 06/10/2026
* Update models
## Instructions
@ -92,6 +94,12 @@ Pull the vLLM container from NGC. Use the **26.01** image on DGX Station; the 25
docker pull nvcr.io/nvidia/vllm:26.01-py3
```
For DiffusionGemma, use the vLLM custom container:
```bash
docker pull vllm/vllm-openai:gemma
```
For Step-3.7-Flash models, pull the custom VLLM container
```bash
docker pull vllm/vllm-openai:stepfun37
@ -119,6 +127,34 @@ docker run -d \
--gpu-memory-utilization 0.9
```
For DiffusionGemma models (e.g. `google/diffusiongemma-26B-A4B-it`), run with custom VLLM container.
```bash
docker run -d \
--name vllm-server \
-p 8000:8000 \
--gpus all \
--shm-size=16g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-e VLLM_USE_V2_MODEL_RUNNER=1 \
-e HF_TOKEN="$HF_TOKEN" \
vllm/vllm-openai:gemma ${MODEL_HANDLE} \
--gpu-memory-utilization 0.85 \
--attention-backend TRITON_ATTN \
--max-num-seqs 16 \
--diffusion-config '{"canvas_length":256}' \
--override-generation-config '{"max_new_tokens": null}' \
--load-format fastsafetensors \
--enable-prefix-caching \
--reasoning-parser gemma4 \
--default-chat-template-kwargs '{"enable_thinking": true}' \
--enable-auto-tool-choice \
--tool-call-parser gemma4
## For BF16 checkpoint add "--moe-backend triton" for better performance
```
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
```bash

View File

@ -65,6 +65,8 @@ spec:
| Model | Quantization | Support Status | HF Handle |
|-------|-------------|----------------|-----------|
| **DiffusionGemma 26B A4B IT** | BF16 | ✅ | [`google/diffusiongemma-26B-A4B-it`](https://huggingface.co/google/diffusiongemma-26B-A4B-it) |
| **DiffusionGemma 26B A4B IT** | NVFP4 | ✅ | [`nvidia/diffusiongemma-26B-A4B-it-NVFP4`](https://huggingface.co/nvidia/diffusiongemma-26B-A4B-it-NVFP4) |
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
@ -74,7 +76,7 @@ spec:
* **Duration:** 30 minutes (longer on first run due to model download)
* **Risks:** Model download requires HuggingFace authentication
* **Rollback:** Stop and remove the container to restore state
* **Last Updated:** 05/28/2026
* **Last Updated:** 06/10/2026
* Update models
@ -117,6 +119,12 @@ spec:
docker pull nvcr.io/nvidia/vllm:26.01-py3
```
For DiffusionGemma, use the vLLM custom container:
```bash
docker pull vllm/vllm-openai:gemma
```
For Step-3.7-Flash models, pull the custom VLLM container
```bash
docker pull vllm/vllm-openai:stepfun37
@ -144,6 +152,34 @@ spec:
--gpu-memory-utilization 0.9
```
For DiffusionGemma models (e.g. `google/diffusiongemma-26B-A4B-it`), run with custom VLLM container.
```bash
docker run -d \
--name vllm-server \
-p 8000:8000 \
--gpus all \
--shm-size=16g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-e VLLM_USE_V2_MODEL_RUNNER=1 \
-e HF_TOKEN="$HF_TOKEN" \
vllm/vllm-openai:gemma ${MODEL_HANDLE} \
--gpu-memory-utilization 0.85 \
--attention-backend TRITON_ATTN \
--max-num-seqs 16 \
--diffusion-config '{"canvas_length":256}' \
--override-generation-config '{"max_new_tokens": null}' \
--load-format fastsafetensors \
--enable-prefix-caching \
--reasoning-parser gemma4 \
--default-chat-template-kwargs '{"enable_thinking": true}' \
--enable-auto-tool-choice \
--tool-call-parser gemma4
# For BF16 checkpoint add "--moe-backend triton" for better performance
```
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
```bash