chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2026-06-24 15:31:38 +00:00
parent 797933babb
commit 0c6aab8e63
5 changed files with 332 additions and 204 deletions

View File

@ -209,7 +209,7 @@ spec:
df -h /
```
Expected: Blackwell Ultra GPU, Docker >= 23.0.1, **Node.js v22.x** (the DGX Station ships with v18 — see below), OpenShell >= 0.0.33, and **at least 200 GB free** on `/` (86 GB model + Docker images + working space).
Expected: Blackwell Ultra GPU, Docker >= 23.0.1, **Node.js v22.x**, OpenShell >= 0.0.33, and **at least 200 GB free** on `/` (86 GB model + Docker images + working space).
> [!WARNING]
> If `openshell --version` says `command not found`, the binary is at `~/.local/bin/openshell` but isn't on PATH. Run the `export PATH=...` line above and re-source `~/.bashrc`. Without this, every `openshell` and `make` command in later steps fails.
@ -217,10 +217,14 @@ spec:
> [!TIP]
> `make prereq` (run from `~/clinical-intelligence` after Step 2) bundles all of the checks below — Docker, Node version, OpenShell, disk space, GPU, port 11434, and NGC auth — into one command.
**If `node --version` reports v18.x or older**, install Node.js v22 before continuing:
**If `node --version` reports v18.x, older, or `command not found`**, install Node.js v22 before continuing:
```bash
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
# Download the NodeSource setup script first, then run it with sudo.
# Running it inline with `| sudo bash` does not work — the sudo context
# needs to own the entire script execution.
curl -fsSL https://deb.nodesource.com/setup_22.x -o /tmp/nodesource_setup.sh
sudo bash /tmp/nodesource_setup.sh
sudo apt-get install -y nodejs
node --version # should now show v22.x
```
@ -239,10 +243,20 @@ spec:
Expected: `port 11434 free`. If the line still shows a listener, something else (an old `ollama serve`, another container, etc.) owns the port — stop it, or change `OLLAMA_PORT` in `.env` (Step 2) to a free port such as `11435`. `make setup` sources `.env` and configures the sandbox provider against the override.
**Stale OpenShell gateway?** If you previously ran the NemoClaw playbook, an existing gateway will be silently reused under the new name. To start clean:
**Stale OpenShell gateway?** If you previously ran a playbook that started `openshell-gateway`, kill the process and remove the registration:
```bash
openshell gateway destroy 2>/dev/null || true
pkill -f openshell-gateway 2>/dev/null || true
openshell gateway remove openshell 2>/dev/null || true
```
**Previously ran the NemoClaw playbook?** NemoClaw installs `openclaw-gateway.service` as a systemd user service that binds port 18789. If it is still running, `make setup` fails with "Port 18789 is already in use". Stop and disable it before proceeding — `make setup` will also do this automatically, but stopping it here avoids a confusing error:
```bash
systemctl --user stop openclaw-gateway.service 2>/dev/null || true
systemctl --user disable openclaw-gateway.service 2>/dev/null || true
# Verify the port is free
ss -tlnp | grep 18789 || echo 'port 18789 free'
```
# Step 2. Copy the assets and configure
@ -325,8 +339,8 @@ spec:
Expected:
```
Ollama: ✓ healthy
OpenFold3: ✓ healthy
Ollama (port 11434): ✓ healthy
OpenFold3 (port 8000): ✓ healthy
```
OpenFold3 takes ~3 minutes to load model weights on startup. If it shows "down (may still be loading)", wait and check again.
@ -336,26 +350,30 @@ spec:
# Step 4. Start the OpenShell gateway
The OpenShell gateway runs a lightweight k3s Kubernetes cluster inside Docker to manage sandboxes. On DGX Station, the kernel uses cgroup v2 with the systemd driver, but k3s defaults to cgroupfs. The flag below tells k3s to match the host:
OpenShell >= 0.0.40 ships `openshell-gateway`, a standalone server binary installed alongside the CLI. Start it with the Docker driver (no Kubernetes required), then register it with the CLI:
```bash
OPENSHELL_K3S_ARGS='--kubelet-arg=cgroup-driver=systemd' openshell gateway start
# Start the gateway server in the background using the Docker compute driver.
# --disable-tls is safe for local-only use (loopback-bound).
nohup openshell-gateway \
--disable-tls \
--drivers docker \
--bind-address 127.0.0.1 \
--port 17670 \
> /tmp/openshell-gateway.log 2>&1 &
echo "Gateway PID: $!"
# Register the gateway with the CLI and set it as active.
openshell gateway add http://127.0.0.1:17670 --name openshell
```
Wait for the gateway's embedded k3s cluster to finish initializing (1015 seconds after `gateway start` returns), then verify:
Verify the gateway is connected:
```bash
# Wait until the gateway accepts connections, fail after 60s
for i in $(seq 1 30); do
if openshell status 2>/dev/null | grep -q "Connected"; then
echo "Gateway: Connected"; break
fi
sleep 2
done
openshell status
```
Expected: `Status: Connected`. If the first `openshell status` immediately after `gateway start` reports `Connection reset by peer`, that is normal — k3s is still warming up. The loop above polls until it is ready.
Expected: `Status: Connected`. If not connected, check `/tmp/openshell-gateway.log` for errors. The gateway typically starts in under 1 second.
> [!NOTE]
> Step 4 configures OpenShell infrastructure (gateway). Step 5 deploys the healthcare agent into this infrastructure.
@ -492,7 +510,8 @@ spec:
```bash
openshell sandbox delete clinical-sandbox
make down
openshell gateway destroy
pkill -f openshell-gateway 2>/dev/null || true
openshell gateway remove openshell 2>/dev/null || true
```
To also remove downloaded models and volumes:

View File

@ -1,6 +1,6 @@
# Local Coding Agent
> Run local CLI coding agents with Ollama on DGX Station (NVIDIA GB300) using glm-4.7-flash (fast) or unsloth/GLM-4.7-GGUF:Q8_0 (best quality)
> Run local CLI coding agents with Claude Code and Ollama on DGX Station (NVIDIA GB300) using qwen3.6:27b
## Table of Contents
@ -15,10 +15,10 @@
## Basic idea
Use Ollama on **DGX Station (NVIDIA GB300)** to run local coding models and connect a CLI coding agent. This
playbook uses **Claude Code** to talk to Ollama for local inference, so you can work without external cloud APIs.
Use Ollama on **DGX Station (NVIDIA GB300)** to run a local coding model and connect a CLI coding agent. This
playbook uses **Claude Code** with `ollama launch` so you can work without external cloud APIs.
The DGX Station GPU (reported as **NVIDIA GB300** in `nvidia-smi`) provides ample memory to run **glm-4.7-flash** (fast loading and testing) and larger models such as **unsloth/GLM-4.7-GGUF:Q8_0** (best quality), both supported on Ollama.
The DGX Station GPU (reported as **NVIDIA GB300** in `nvidia-smi`) provides ample memory to run **qwen3.6:27b** with Ollama for local coding-agent workflows.
## CLI agent
@ -26,7 +26,7 @@ This playbook uses **Claude Code** as the CLI agent, connected to a local Ollama
## What you'll accomplish
You will run a local coding model on your **DGX Station (NVIDIA GB300)** with Ollama, connect Claude Code to it, and complete a small coding task end-to-end. Use **glm-4.7-flash** (including high-quality variants) or **unsloth/GLM-4.7-GGUF:Q8_0** for best quality.
You will run **qwen3.6:27b** on your **DGX Station (NVIDIA GB300)** with Ollama, connect Claude Code to it, and complete a small coding task end-to-end.
## What to know before starting
@ -38,12 +38,9 @@ You will run a local coding model on your **DGX Station (NVIDIA GB300)** with Ol
- **DGX Station** with **NVIDIA GB300** (Grace Blackwell) and NVIDIA driver; `nvidia-smi` typically shows "NVIDIA GB300"
- Internet access to download model weights
- **Ollama 0.15.0 or newer** (required for GLM-4.7-Flash; do not pin to 0.14.3)
- **GPU memory** on GB300 supports both recommended models:
- **glm-4.7-flash**: ~19 GB (`latest`) to ~60 GB (bf16) — **recommended for fast loading and testing**
- **unsloth/GLM-4.7-GGUF:Q8_0** (Hugging Face on Ollama): larger model — **recommended for best quality**
- Other variants (e.g. `glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) fit on GB300
- **Disk space** for model downloads: plan for ~19 GB for `glm-4.7-flash:latest`, plus additional space for the Q8_0 or bf16 variants if you use them
- **Ollama 0.15.0 or newer**
- **GPU memory** on GB300 supports the recommended `qwen3.6:27b` model
- **Disk space** for the `qwen3.6:27b` model download
## Time & risk
@ -52,8 +49,8 @@ You will run a local coding model on your **DGX Station (NVIDIA GB300)** with Ol
* Large model downloads can fail if network connectivity is unstable
* Older Ollama versions will not load newer models
* **Rollback**: Stop Ollama and delete the downloaded model from `~/.ollama/models`
* **Last Updated:** 03/06/2026
* Model set to glm-4.7-flash; Ollama 0.15.0+; cleanup order and docs refresh
* **Last Updated:** 06/12/2026
* Model path set to qwen3.6:27b with `ollama launch`; Python task now uses a virtual environment
## Claude Code
@ -85,13 +82,13 @@ curl -fsSL https://ollama.com/install.sh | sh
ollama --version
```
To install a specific version (e.g. 0.15.0 or newer, required for GLM-4.7-Flash):
To install a specific version if needed:
```bash
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.15.0 sh
```
If Ollama is already present and the version is 0.15.0 or newer, simply run:
If Ollama is already present, simply run:
```bash
ollama --version
@ -105,25 +102,12 @@ ollama version is 0.15.0
## Step 3. Pull a coding model
**Description**: Download the model weights to your DGX Station. This playbook supports two model options on Ollama; choose one (or both) depending on whether you want **fast loading and testing** or **best quality**.
**Description**: Download the model weights to your DGX Station.
**For fast loading and testing** — **glm-4.7-flash** (~19 GB for `latest`; loads quickly; ensure Ollama 0.15.0+):
This playbook uses **qwen3.6:27b** with Claude Code through Ollama:
```bash
ollama pull glm-4.7-flash
```
**For best quality** — **unsloth/GLM-4.7-GGUF:Q8_0** from Hugging Face (larger, higher quality; supported on Ollama):
```bash
ollama pull hf.co/unsloth/GLM-4.7-GGUF:Q8_0
```
**Other glm-4.7-flash variants** on GB300 (more GPU memory; bf16 is ~60 GB):
```bash
ollama pull glm-4.7-flash:q8_0
ollama pull glm-4.7-flash:bf16
ollama pull qwen3.6:27b
```
**Expected output** (example): Progress lines followed by "success" and the model in `ollama list`:
@ -134,22 +118,15 @@ ollama list
```text
NAME ID SIZE MODIFIED
glm-4.7-flash:latest abc123... 19 GB 1 minute ago
unsloth/GLM-4.7-GGUF:Q8_0 def456... ... ...
qwen3.6:27b abc123... ... 1 minute ago
```
## Step 4. Test local inference
**Description**: Run a quick prompt to confirm the model loads. Use the same model name you pulled (e.g. `glm-4.7-flash` for fast testing, or `hf.co/unsloth/GLM-4.7-GGUF:Q8_0` for best quality).
**Description**: Run a quick prompt to confirm the model loads.
```bash
ollama run glm-4.7-flash
```
Or, if you pulled the larger model:
```bash
ollama run hf.co/unsloth/GLM-4.7-GGUF:Q8_0
ollama run qwen3.6:27b
```
Try a prompt like:
@ -158,7 +135,7 @@ Try a prompt like:
Write a short README checklist for a Python project.
```
**Expected output**: GLM-4.7-Flash may show **Thinking...** and reasoning text before the final answer, then the model's response. This is normal; wait for the reply to complete.
**Expected output**: The model replies with a short README checklist.
**Exit the Ollama REPL** when done: type `/bye` or press **Ctrl+D**.
@ -167,7 +144,7 @@ Write a short README checklist for a Python project.
**Description**: Install the CLI tool that will drive the local model.
```bash
curl -fsSL https://claude.ai/install.sh | sh
curl -fsSL https://claude.ai/install.sh | bash
```
**Verify the installation**:
@ -184,10 +161,10 @@ claude --version
larger codebases, set it to 64K tokens. This increases memory usage.
For more details on configuring context length and other parameters, see the Ollama documentation (context window and runtime options).
Set the context length per session in the Ollama REPL (use the same model name you pulled, e.g. `glm-4.7-flash` or `hf.co/unsloth/GLM-4.7-GGUF:Q8_0`):
Set the context length per session in the Ollama REPL:
```bash
ollama run glm-4.7-flash
ollama run qwen3.6:27b
```
Then, in the Ollama prompt:
@ -210,33 +187,13 @@ Keep this terminal open and run the next step in a new terminal.
## Step 7. Connect Claude Code to Ollama
**Description**: Point Claude Code to the local Ollama server and launch it. Use the model you pulled: `glm-4.7-flash` (fast) or `hf.co/unsloth/GLM-4.7-GGUF:Q8_0` (best quality).
**Description**: Launch Claude Code through Ollama with the model you pulled.
```bash
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_BASE_URL=http://localhost:11434
claude --model glm-4.7-flash
ollama launch claude --model qwen3.6:27b
```
If you are using the larger model:
```bash
claude --model hf.co/unsloth/GLM-4.7-GGUF:Q8_0
```
- **`ANTHROPIC_AUTH_TOKEN=ollama`**: Claude Code treats the literal value `ollama` as a special token that means "use the local Ollama backend" instead of Anthropic's cloud API. No real API key is needed when using Ollama.
- **`ANTHROPIC_BASE_URL`**: Tells Claude Code to send requests to your local Ollama server at port 11434.
**Persist these variables** (optional) so you don't have to re-export every terminal session. Add to `~/.bashrc` or your shell profile (e.g. `~/.zshrc`):
```bash
echo 'export ANTHROPIC_AUTH_TOKEN=ollama' >> ~/.bashrc
echo 'export ANTHROPIC_BASE_URL=http://localhost:11434' >> ~/.bashrc
source ~/.bashrc
```
**Expected output**: Claude Code starts and uses the local model.
**Expected output**: Claude Code starts and uses the local Ollama model.
**Exit Claude Code** when done: type `/exit` or press **Ctrl+C**.
@ -247,15 +204,18 @@ source ~/.bashrc
```bash
mkdir -p ~/cli-agent-demo
cd ~/cli-agent-demo
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pytest
printf 'def add(a, b):\n """Return the sum of a and b."""\n pass\n' > math_utils.py
printf 'import math_utils\n\n\ndef test_add():\n assert math_utils.add(1, 2) == 3\n' > test_math_utils.py
```
If you do not already have pytest installed:
If Claude Code is not already running, launch it:
```bash
python -m pip install -U pytest
ollama launch claude --model qwen3.6:27b
```
In Claude Code, enter:
@ -267,7 +227,8 @@ Please implement add() in math_utils.py and make sure the test passes.
**Exit Claude Code** when finished: type `/exit` or press **Ctrl+C**, then run the test:
```bash
python -m pytest -q
python3 -m pytest -q
deactivate
```
Expected output should show the test passing.
@ -282,17 +243,9 @@ Expected output should show the test passing.
**1. Remove the model** (Ollama must be running). Use the same name you pulled:
```bash
ollama rm glm-4.7-flash
ollama rm qwen3.6:27b
```
Or, for the Hugging Face model:
```bash
ollama rm hf.co/unsloth/GLM-4.7-GGUF:Q8_0
```
Use the exact tag you pulled (e.g. `glm-4.7-flash:bf16` if you used that variant).
**2. Stop the Ollama service**:
```bash
@ -301,8 +254,6 @@ sudo systemctl stop ollama
## Step 10. Next steps
- **Fast loading and testing:** use **glm-4.7-flash** for quick iteration and smaller downloads.
- **Best quality:** use **unsloth/GLM-4.7-GGUF:Q8_0** (Hugging Face on Ollama) or **glm-4.7-flash** high-quality variants (`glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) on DGX Station (NVIDIA GB300).
- Use larger context (e.g. 64K198K) for big codebases.
- Use Claude Code on multi-file refactors or test-generation tasks.
@ -311,12 +262,16 @@ sudo systemctl stop ollama
| Symptom | Cause | Fix |
|---------|-------|-----|
| `ollama: command not found` | Ollama not installed or PATH not updated | Rerun `curl -fsSL https://ollama.com/install.sh | sh` and open a new shell |
| Model load fails with version error | Ollama is older than 0.15.0 | Update Ollama to 0.15.0 or newer (required for GLM-4.7-Flash). Do not pin to 0.14.3. |
| `model not found` in Claude Code | Model was not pulled | Run `ollama pull glm-4.7-flash` or `ollama pull hf.co/unsloth/GLM-4.7-GGUF:Q8_0` and retry. Use the same model name in `claude --model ...`. |
| Model load fails with version error | Ollama is older than the model requires | Update Ollama to a current stable release. Do not pin to older versions. |
| `model not found` in Claude Code | Model was not pulled | Run `ollama pull qwen3.6:27b` and retry with `ollama launch claude --model qwen3.6:27b`. |
| `connection refused` to localhost:11434 | Ollama service not running | Start with `ollama serve` or `sudo systemctl start ollama` |
| Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station (NVIDIA GB300), ensure no other heavy GPU workloads. If OOM persists, use a smaller variant (e.g. `glm-4.7-flash:q8_0` or `glm-4.7-flash:q4_K_M`) or `OLLAMA_MAX_LOADED_MODELS=1`. |
| Sharded GGUF model pull fails with HTTP 400 | Ollama does not support pulling sharded GGUF models from Hugging Face | Use the documented `qwen3.6:27b` model instead: `ollama pull qwen3.6:27b`. |
| `CUDA error: context is destroyed` on a dual-GPU Station | Ollama may fail when both the GB300 and RTX PRO 6000 GPUs are visible | Run Ollama with one visible GPU. For example, set `CUDA_VISIBLE_DEVICES=1` in the Ollama service environment, restart Ollama, and rerun the playbook. |
| Claude Code edit task fails through the direct Ollama endpoint | Direct endpoint wiring can fail with some Ollama/model combinations | Launch Claude Code through Ollama instead: `ollama launch claude --model qwen3.6:27b`. |
| `externally-managed-environment` or Python package install fails | System Python blocks direct package installs | Create and activate a virtual environment, then install pytest inside it: `python3 -m venv .venv`, `source .venv/bin/activate`, `python3 -m pip install -U pytest`. |
| Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station (NVIDIA GB300), ensure no other heavy GPU workloads. If OOM persists, unload other models or set `OLLAMA_MAX_LOADED_MODELS=1`. |
| `claude: command not found` after install | CLI not on PATH or install script did not complete | Restart the terminal or run `source ~/.bashrc` (or your shell profile). Check the install script output for the install path and add it to PATH. |
| Claude Code install fails (Node.js / network) | Node.js missing or install script cannot download | Ensure Node.js is installed (`node --version`). If the install script fails with a network error, retry from a stable connection or download the Claude Code CLI from the official site. See [Claude Code documentation](https://claude.ai/docs) for alternatives. |
| Claude Code install fails (Node.js / network) | Node.js missing or install script cannot download | Ensure Node.js is installed (`node --version`). Run the installer with Bash: `curl -fsSL https://claude.ai/install.sh | bash`. If the install script fails with a network error, retry from a stable connection or download the Claude Code CLI from the official site. See [Claude Code documentation](https://claude.ai/docs) for alternatives. |
> [!NOTE]
> DGX Station with **NVIDIA GB300** provides ample GPU memory for **glm-4.7-flash** (fast testing) and **unsloth/GLM-4.7-GGUF:Q8_0** (best quality), plus variants (e.g. `glm-4.7-flash:bf16`). Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models.
> DGX Station with **NVIDIA GB300** provides ample GPU memory for the documented `qwen3.6:27b` workflow. Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models.

View File

@ -2,7 +2,7 @@ kind: Playbook
metadata:
name: station-local-coding-agent
displayName: Local Coding Agent
shortDescription: Run local CLI coding agents with Ollama on DGX Station (GB300 Ultra) using GLM-4.7 and GLM-4.7-Flash
shortDescription: Run local CLI coding agents with Claude Code and Ollama on DGX Station (NVIDIA GB300) using qwen3.6:27b
publisher: nvidia
description: |
@ -17,8 +17,6 @@ metadata:
- LLM
- Ollama
- Claude Code
- OpenCode
- Codex
attributes:
- key: DURATION
@ -41,24 +39,18 @@ spec:
content: |
# Basic idea
Use Ollama on **DGX Station with GB300 Ultra** to run local coding models and connect a CLI coding agent. This
playbook supports three options: **Claude Code**, **OpenCode**, and **Codex CLI**. Each
agent talks to Ollama for local inference, so you can work without external cloud APIs.
Use Ollama on **DGX Station (NVIDIA GB300)** to run a local coding model and connect a CLI coding agent. This
playbook uses **Claude Code** with `ollama launch` so you can work without external cloud APIs.
The GB300 Ultras massive GPU memory lets you run **GLM-4.7** and **GLM-4.7-Flash** in high-quality variants (e.g. bf16, q8_0) for the best coding-assistant quality directly on the Station.
The DGX Station GPU (reported as **NVIDIA GB300** in `nvidia-smi`) provides ample memory to run **qwen3.6:27b** with Ollama for local coding-agent workflows.
# Choose your CLI agent
# CLI agent
Pick the tab that matches the CLI agent you want to use:
- **Claude Code**: Fastest path to a working CLI agent with a local Ollama model.
- **OpenCode**: Open-source CLI with provider configuration; this guide targets Ollama.
- **Codex CLI**: OpenAI Codex CLI configured to run against Ollama locally.
This playbook uses **Claude Code** as the CLI agent, connected to a local Ollama model for inference.
# What you'll accomplish
You will run a local coding model on your **DGX Station (GB300 Ultra)** with Ollama, connect it to your
chosen CLI agent, and complete a small coding task end-to-end. You can use **GLM-4.7** or **GLM-4.7-Flash** (including high-quality variants) to take full advantage of the Stations memory.
You will run **qwen3.6:27b** on your **DGX Station (NVIDIA GB300)** with Ollama, connect Claude Code to it, and complete a small coding task end-to-end.
# What to know before starting
@ -68,13 +60,11 @@ spec:
# Prerequisites
- **DGX Station** with **GB300 Ultra** (Grace Blackwell) and NVIDIA driver
- **DGX Station** with **NVIDIA GB300** (Grace Blackwell) and NVIDIA driver; `nvidia-smi` typically shows "NVIDIA GB300"
- Internet access to download model weights
- Ollama 0.14.3 or newer
- **GPU memory** on GB300 Ultra supports GLM-4.7 and high-quality variants:
- **GLM-4.7-Flash** (30B): ~19GB (latest) to ~60GB (bf16) — recommended default for coding
- **GLM-4.7** (full): use `ollama pull glm-4.7` for higher quality when available
- High-quality variants (e.g. `glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) fit comfortably on GB300 Ultra
- **Ollama 0.15.0 or newer**
- **GPU memory** on GB300 supports the recommended `qwen3.6:27b` model
- **Disk space** for the `qwen3.6:27b` model download
# Time & risk
@ -83,8 +73,8 @@ spec:
* Large model downloads can fail if network connectivity is unstable
* Older Ollama versions will not load newer models
* **Rollback**: Stop Ollama and delete the downloaded model from `~/.ollama/models`
* **Last Updated:** February 2025
* Tailored for DGX Station with GB300 Ultra; added large-model recommendations
* **Last Updated:** 06/12/2026
* Model path set to qwen3.6:27b with `ollama launch`; Python task now uses a virtual environment
@ -101,51 +91,71 @@ spec:
nvidia-smi
```
Expected output should show a detected GPU (e.g. GB300 Ultra).
**Expected output** (example): A table showing driver version and GPU(s). On DGX Station, the GPU name may appear as **NVIDIA GB300** (without "Ultra"):
```text
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 5xx.xx Driver Version: 5xx.xx CUDA Version: 12.x |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| 0 NVIDIA GB300 On | 00000000:06:00.0 Off | 0 |
...
```
# Step 2. Install or update Ollama
**Description**: Install Ollama or ensure it is recent enough for modern coding models.
```bash
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.14.3 sh
curl -fsSL https://ollama.com/install.sh | sh
ollama --version
```
If the ollama is already present and the version is 0.14.3 or newer, simply run:
To install a specific version if needed:
```bash
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.15.0 sh
```
If Ollama is already present, simply run:
```bash
ollama --version
```
Expected output should show `ollama --version` as 0.14.3 or newer.
**Expected output** (example):
```text
ollama version is 0.15.0
```
# Step 3. Pull a coding model
**Description**: Download the model weights to your DGX Station. This playbook uses **GLM-4.7** where available.
**Description**: Download the model weights to your DGX Station.
**Recommended: GLM-4.7**:
This playbook uses **qwen3.6:27b** with Claude Code through Ollama:
```bash
ollama pull glm-4.7
ollama pull qwen3.6:27b
```
**High-quality variants** on GB300 Ultra (use more GPU memory for better quality):
**Expected output** (example): Progress lines followed by "success" and the model in `ollama list`:
```bash
ollama pull glm-4.7-flash:q8_0
ollama pull glm-4.7-flash:bf16
ollama list
```
Expected output should show your model in `ollama list`.
```text
NAME ID SIZE MODIFIED
qwen3.6:27b abc123... ... 1 minute ago
```
# Step 4. Test local inference
**Description**: Run a quick prompt to confirm the model loads. Use the same model name you pulled (e.g. `glm-4.7`).
**Description**: Run a quick prompt to confirm the model loads.
```bash
ollama run glm-4.7
ollama run qwen3.6:27b
```
Try a prompt like:
@ -154,26 +164,36 @@ spec:
Write a short README checklist for a Python project.
```
Expected output should show the model responding in the terminal.
**Expected output**: The model replies with a short README checklist.
**Exit the Ollama REPL** when done: type `/bye` or press **Ctrl+D**.
# Step 5. Install Claude Code
**Description**: Install the CLI tool that will drive the local model.
```bash
curl -fsSL https://claude.ai/install.sh | sh
curl -fsSL https://claude.ai/install.sh | bash
```
**Verify the installation**:
```bash
claude --version
```
**Expected output** (example): A version string such as `claude 0.x.x` or similar. If you see `claude: command not found`, ensure the install script added the CLI to your PATH (e.g. restart the terminal or source your shell profile); see [Troubleshooting](troubleshooting.md).
# Step 6. Increase context length (optional)
**Description**: Ollama defaults to a 4096 token context length. For coding agents and
larger codebases, set it to 64K tokens. This increases memory usage.
For more details on configuring context length, see the [Ollama documentation](https://ollama.com/docs/faq#how-can-i-increase-the-context-length).
For more details on configuring context length and other parameters, see the Ollama documentation (context window and runtime options).
Set the context length per session in the Ollama REPL:
```bash
ollama run glm-4.7
ollama run qwen3.6:27b
```
Then, in the Ollama prompt:
@ -183,6 +203,8 @@ spec:
```
**Exit when done**: type `/bye` or press **Ctrl+D**.
Optional method (set globally when serving Ollama):
```bash
@ -194,16 +216,15 @@ spec:
# Step 7. Connect Claude Code to Ollama
**Description**: Point Claude Code to the local Ollama server and launch it. Use the model you pulled (e.g. GLM-4.7 or GLM-4.7-Flash).
**Description**: Launch Claude Code through Ollama with the model you pulled.
```bash
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_BASE_URL=http://localhost:11434
claude --model glm-4.7
ollama launch claude --model qwen3.6:27b
```
Expected output should show Claude Code starting and using the local model.
**Expected output**: Claude Code starts and uses the local Ollama model.
**Exit Claude Code** when done: type `/exit` or press **Ctrl+C**.
# Step 8. Complete a small coding task
@ -212,53 +233,58 @@ spec:
```bash
mkdir -p ~/cli-agent-demo
cd ~/cli-agent-demo
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pytest
printf 'def add(a, b):\n """Return the sum of a and b."""\n pass\n' > math_utils.py
printf 'import math_utils\n\n\ndef test_add():\n assert math_utils.add(1, 2) == 3\n' > test_math_utils.py
```
If you do not already have pytest installed:
If Claude Code is not already running, launch it:
```bash
python -m pip install -U pytest
ollama launch claude --model qwen3.6:27b
```
In Claude Code:
In Claude Code, enter:
```text
Please implement add() in math_utils.py and make sure the test passes.
```
Run the test:
**Exit Claude Code** when finished: type `/exit` or press **Ctrl+C**, then run the test:
```bash
python -m pytest -q
python3 -m pytest -q
deactivate
```
Expected output should show the test passing.
# Step 9. Cleanup and rollback
**Description**: Remove the model and stop services if you no longer need them.
**Description**: Remove the model and stop the Ollama service if you no longer need them. **Remove the model first** (while the Ollama server is running), then stop the service.
To stop the service:
> [!WARNING]
> The following removes the downloaded model files from disk.
**1. Remove the model** (Ollama must be running). Use the same name you pulled:
```bash
ollama rm qwen3.6:27b
```
**2. Stop the Ollama service**:
```bash
sudo systemctl stop ollama
```
> [!WARNING]
> This will delete the downloaded model files.
```bash
ollama rm glm-4.7
```
# Step 10. Next steps
- Use **GLM-4.7** or high-quality variants (`glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) on GB300 Ultra for best quality
- Use larger context (e.g. 64K198K) for big codebases
- Use Claude Code on multi-file refactors or test-generation tasks
- Use larger context (e.g. 64K198K) for big codebases.
- Use Claude Code on multi-file refactors or test-generation tasks.
@ -270,18 +296,19 @@ spec:
| Symptom | Cause | Fix |
|---------|-------|-----|
| `ollama: command not found` | Ollama not installed or PATH not updated | Rerun `curl -fsSL https://ollama.com/install.sh | sh` and open a new shell |
| Model load fails with version error | Ollama is older than 0.14.3 | Update Ollama to 0.14.3 or newer |
| `model not found` in Claude Code | Model was not pulled | Run `ollama pull glm-4.7-flash` or `ollama pull glm-4.7` and retry |
| `opencode: command not found` | OpenCode not installed or PATH not updated | Install OpenCode and open a new shell |
| OpenCode cannot reach Ollama | `baseURL` misconfigured or Ollama not running | Set `baseURL` to `http://localhost:11434/v1` and start Ollama |
| `codex: command not found` | Codex CLI not installed or PATH not updated | Install Codex CLI and open a new shell |
| Codex CLI uses the wrong model/provider | `~/.codex/config.toml` not pointing to Ollama | Set `model_provider = "ollama"` and `base_url = "http://localhost:11434/v1"` |
| `connection refused` to localhost:11434 | Ollama service not running | Start with `ollama serve` or `systemctl start ollama` |
| Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station GB300 Ultra, ensure no other heavy GPU workloads. If OOM persists, use a smaller variant (e.g. `glm-4.7-flash:q8_0` or `glm-4.7-flash:q4_K_M`) or `OLLAMA_MAX_LOADED_MODELS=1`. |
| Model load fails with version error | Ollama is older than the model requires | Update Ollama to a current stable release. Do not pin to older versions. |
| `model not found` in Claude Code | Model was not pulled | Run `ollama pull qwen3.6:27b` and retry with `ollama launch claude --model qwen3.6:27b`. |
| `connection refused` to localhost:11434 | Ollama service not running | Start with `ollama serve` or `sudo systemctl start ollama` |
| Sharded GGUF model pull fails with HTTP 400 | Ollama does not support pulling sharded GGUF models from Hugging Face | Use the documented `qwen3.6:27b` model instead: `ollama pull qwen3.6:27b`. |
| `CUDA error: context is destroyed` on a dual-GPU Station | Ollama may fail when both the GB300 and RTX PRO 6000 GPUs are visible | Run Ollama with one visible GPU. For example, set `CUDA_VISIBLE_DEVICES=1` in the Ollama service environment, restart Ollama, and rerun the playbook. |
| Claude Code edit task fails through the direct Ollama endpoint | Direct endpoint wiring can fail with some Ollama/model combinations | Launch Claude Code through Ollama instead: `ollama launch claude --model qwen3.6:27b`. |
| `externally-managed-environment` or Python package install fails | System Python blocks direct package installs | Create and activate a virtual environment, then install pytest inside it: `python3 -m venv .venv`, `source .venv/bin/activate`, `python3 -m pip install -U pytest`. |
| Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station (NVIDIA GB300), ensure no other heavy GPU workloads. If OOM persists, unload other models or set `OLLAMA_MAX_LOADED_MODELS=1`. |
| `claude: command not found` after install | CLI not on PATH or install script did not complete | Restart the terminal or run `source ~/.bashrc` (or your shell profile). Check the install script output for the install path and add it to PATH. |
| Claude Code install fails (Node.js / network) | Node.js missing or install script cannot download | Ensure Node.js is installed (`node --version`). Run the installer with Bash: `curl -fsSL https://claude.ai/install.sh | bash`. If the install script fails with a network error, retry from a stable connection or download the Claude Code CLI from the official site. See [Claude Code documentation](https://claude.ai/docs) for alternatives. |
> [!NOTE]
> DGX Station with GB300 Ultra provides ample GPU memory for **GLM-4.7** and **GLM-4.7-Flash** in high-quality
> variants (e.g. `glm-4.7-flash:bf16`). Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models.
> DGX Station with **NVIDIA GB300** provides ample GPU memory for the documented `qwen3.6:27b` workflow. Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models.
@ -291,31 +318,11 @@ spec:
url: https://ollama.com/docs
- name: GLM-4.7-Flash (Ollama)
url: https://ollama.com/library/glm-4.7-flash
- name: GLM-4.7 (Ollama)
url: https://ollama.com/library/glm-4.7
- name: Qwen3.6 27B
url: https://ollama.com/library/qwen3.6
- name: Claude Code + Ollama Guide
url: https://ollama.com/blog/claude
- name: OpenCode Ollama Provider
url: https://opencode.ai/docs/providers/#ollama
- name: Codex + Ollama Guide
url: https://ollama.com/blog/codex
- name: DGX Station Documentation
url: https://docs.nvidia.com/dgx/dgx-station
- name: DGX Station Forum
url: https://forums.developer.nvidia.com/c/accelerated-computing/dgx-station

View File

@ -65,6 +65,8 @@ spec:
| Model | Quantization | Support Status | HF Handle |
|-------|-------------|----------------|-----------|
| **DiffusionGemma 26B A4B IT** | BF16 | ✅ | [`google/diffusiongemma-26B-A4B-it`](https://huggingface.co/google/diffusiongemma-26B-A4B-it) |
| **DiffusionGemma 26B A4B IT** | NVFP4 | ✅ | [`nvidia/diffusiongemma-26B-A4B-it-NVFP4`](https://huggingface.co/nvidia/diffusiongemma-26B-A4B-it-NVFP4) |
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
@ -74,7 +76,7 @@ spec:
* **Duration:** 30 minutes (longer on first run due to model download)
* **Risks:** Model download requires HuggingFace authentication
* **Rollback:** Stop and remove the container to restore state
* **Last Updated:** 05/28/2026
* **Last Updated:** 06/10/2026
* Update models
@ -117,6 +119,12 @@ spec:
docker pull nvcr.io/nvidia/vllm:26.01-py3
```
For DiffusionGemma, use the vLLM custom container:
```bash
docker pull vllm/vllm-openai:gemma
```
For Step-3.7-Flash models, pull the custom VLLM container
```bash
docker pull vllm/vllm-openai:stepfun37
@ -144,6 +152,34 @@ spec:
--gpu-memory-utilization 0.9
```
For DiffusionGemma models (e.g. `google/diffusiongemma-26B-A4B-it`), run with custom VLLM container.
```bash
docker run -d \
--name vllm-server \
-p 8000:8000 \
--gpus all \
--shm-size=16g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-e VLLM_USE_V2_MODEL_RUNNER=1 \
-e HF_TOKEN="$HF_TOKEN" \
vllm/vllm-openai:gemma ${MODEL_HANDLE} \
--gpu-memory-utilization 0.85 \
--attention-backend TRITON_ATTN \
--max-num-seqs 16 \
--diffusion-config '{"canvas_length":256}' \
--override-generation-config '{"max_new_tokens": null}' \
--load-format fastsafetensors \
--enable-prefix-caching \
--reasoning-parser gemma4 \
--default-chat-template-kwargs '{"enable_thinking": true}' \
--enable-auto-tool-choice \
--tool-call-parser gemma4
# For BF16 checkpoint add "--moe-backend triton" for better performance
```
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
```bash

View File

@ -70,6 +70,8 @@ spec:
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
| **Kimi-K2.5 (1T)** | NVFP4 | ✅ | [`nvidia/Kimi-K2.5-NVFP4`](https://huggingface.co/nvidia/Kimi-K2.5-NVFP4) |
| **DeepSeek-V4-Flash** | NVFP4 | ✅ | [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) |
# Time & risk
@ -78,6 +80,7 @@ spec:
* **Rollback:** Stop and remove the container to restore state
* **Last Updated:** 06/10/2026
* Update models
* Add base configuration example, per-setting explanations, and DeepSeek-V4-Flash recipe
@ -130,11 +133,23 @@ spec:
docker pull vllm/vllm-openai:stepfun37
```
For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, pull the **26.03** image, which includes the `--cpu-offload-params` support used below:
```bash
docker pull nvcr.io/nvidia/vllm:26.03-py3
```
For DeepSeek-V4-Flash, pull the stable DeepSeek-V4 release container. Use the **cu130** build on DGX Station (Blackwell):
```bash
docker pull vllm/vllm-openai:v0.20.0-cu130
```
# Step 4. Start vLLM server
Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.
## Base configuration (most models)
This is the recommended starting point for any model that fits entirely in VRAM on the GB300. The Qwen3-235B-A22B-NVFP4 model, for example, runs directly with this configuration.
```bash
docker run -d \
@ -152,6 +167,12 @@ spec:
--gpu-memory-utilization 0.9
```
Settings used:
- `--max-model-len` — maximum context length (prompt + output) per request. Larger values reserve more GPU memory for the KV cache; size it to your workload.
- `--gpu-memory-utilization 0.9` — fraction of GPU memory vLLM may use for weights and KV cache. `0.9` leaves headroom for other processes; raise toward `0.95` to fit more KV cache if the GPU is dedicated.
## DiffusionGemma 26B A4B
For DiffusionGemma models (e.g. `google/diffusiongemma-26B-A4B-it`), run with custom VLLM container.
```bash
@ -180,6 +201,8 @@ spec:
# For BF16 checkpoint add "--moe-backend triton" for better performance
```
## Step-3.7-Flash (FP8 / NVFP4)
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
```bash
@ -202,6 +225,94 @@ spec:
--kv-cache-dtype fp8
```
Settings used (in addition to the base configuration):
- `--trust-remote-code` — allows the model's custom modeling code (shipped in its repo) to load. Required for Step-3.7.
- `--reasoning-parser step3p5` — parses the model's reasoning/thinking tokens into the dedicated `reasoning_content` response field.
- `--enable-auto-tool-choice` — lets the model decide when to call a tool, enabling OpenAI-compatible function calling.
- `--tool-call-parser step3p5` — parses the model's tool-call output into structured `tool_calls`. Pairs with `--enable-auto-tool-choice`.
- `--kv-cache-dtype fp8` — stores the KV cache in FP8, roughly halving KV-cache memory versus 16-bit and allowing more concurrent/longer sequences.
## Kimi-K2.5 NVFP4 (1T) — CPU offloading
For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, run with the **26.03** NGC container. This model does not fit entirely in VRAM, so the MoE expert weights are offloaded to CPU DRAM with `--cpu-offload-gb 375 --cpu-offload-params experts`. Ensure the system has enough free DRAM to hold the offloaded weights.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
nvcr.io/nvidia/vllm:26.03-py3 \
vllm serve nvidia/Kimi-K2.5-NVFP4 \
--host 0.0.0.0 \
--port 8000 \
--dtype auto \
--kv-cache-dtype auto \
--gpu-memory-utilization 0.95 \
--served-model-name nvidia/Kimi-K2.5-NVFP4 \
--tensor-parallel-size 1 \
--no-enable-prefix-caching \
--trust-remote-code \
--max-model-len 40960 \
--max-num-seqs 1 \
--max-num-batched-tokens 32768 \
--cpu-offload-gb 375 \
--cpu-offload-params experts
```
Settings used (in addition to the base configuration):
- `--cpu-offload-gb 375` — amount of CPU DRAM (in GiB) vLLM may use to hold weights that don't fit in VRAM. Must be large enough for the offloaded experts; the system needs at least this much free DRAM.
- `--cpu-offload-params experts` — offloads only the MoE expert weights (the bulk of a large MoE model) to DRAM, keeping attention and other hot weights in VRAM.
- `--tensor-parallel-size 1` — single GPU; the GB300 serves the whole model.
- `--max-num-seqs 1` / `--max-num-batched-tokens 32768` — caps concurrency to one sequence and the batch token budget. With expert weights paged from DRAM, throughput is offload-bound, so a low concurrency keeps latency predictable.
- `--no-enable-prefix-caching` — disables prefix-cache reuse. Offloaded experts make the memory budget tight, so the cache is turned off here rather than spent on KV reuse.
- `--kv-cache-dtype auto` / `--dtype auto` — let vLLM pick the KV-cache and compute dtypes from the model's quantization (NVFP4).
## DeepSeek-V4-Flash — MTP + agentic
For DeepSeek-V4-Flash, run with the stable **v0.20.0-cu130** container. This recipe targets agentic workloads and enables Multi-Token Prediction (MTP) speculative decoding. On a single GB300 (TP1) the MoE expert-parallel path is used; the `deep_gemm_mega_moe` backend from some internal recipes is not needed at TP1 and is omitted here.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
vllm/vllm-openai:v0.20.0-cu130 \
deepseek-ai/DeepSeek-V4-Flash \
--enable-expert-parallel \
--kv-cache-dtype fp8 \
--trust-remote-code \
--block-size 256 \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
--attention_config.use_fp4_indexer_cache True \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
--max-model-len 32768
```
Settings used (in addition to the base configuration):
- `--enable-expert-parallel` — shards the MoE experts across the available GPU(s) using expert parallelism, the recommended MoE execution path for DeepSeek-V4.
- `--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'` — enables **MTP (Multi-Token Prediction)** speculative decoding: the model proposes 3 tokens per step that are verified in a single forward pass, cutting latency for accepted tokens.
- `--kv-cache-dtype fp8` — FP8 KV cache to fit more concurrent/longer sequences.
- `--block-size 256` — KV-cache page size in tokens. DeepSeek-V4 uses multiple KV-cache groups; `256` matches the recipe validated on Station.
- `--attention_config.use_fp4_indexer_cache True` — enables the FP4 indexer cache used by DeepSeek-V4's attention. (Drop this flag on platforms without native FP4, e.g. Hopper.)
- `--tokenizer-mode deepseek_v4` / `--tool-call-parser deepseek_v4` / `--reasoning-parser deepseek_v4` — DeepSeek-V4-specific tokenizer, tool-call, and reasoning parsers.
- `--enable-auto-tool-choice` — OpenAI-compatible function calling for agentic use.
- `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` — uses full + piecewise CUDA graph capture and enables all custom ops for lower per-step overhead.
- **Prefix caching is left enabled (the vLLM default).** For agentic workloads with large shared prefixes (e.g. a 32k system/context prefix) at low batch sizes (~BS 34), prefix caching gives a significant throughput boost by reusing the cached prefix across requests. Some internal recipes carry `--no-enable-prefix-caching`, but that was inherited from random-data benchmarking and is not recommended for agentic use here.
Check the server logs for startup progress:
```bash