mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-06-24 23:29:31 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
797933babb
commit
0c6aab8e63
@ -209,7 +209,7 @@ spec:
|
|||||||
df -h /
|
df -h /
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected: Blackwell Ultra GPU, Docker >= 23.0.1, **Node.js v22.x** (the DGX Station ships with v18 — see below), OpenShell >= 0.0.33, and **at least 200 GB free** on `/` (86 GB model + Docker images + working space).
|
Expected: Blackwell Ultra GPU, Docker >= 23.0.1, **Node.js v22.x**, OpenShell >= 0.0.33, and **at least 200 GB free** on `/` (86 GB model + Docker images + working space).
|
||||||
|
|
||||||
> [!WARNING]
|
> [!WARNING]
|
||||||
> If `openshell --version` says `command not found`, the binary is at `~/.local/bin/openshell` but isn't on PATH. Run the `export PATH=...` line above and re-source `~/.bashrc`. Without this, every `openshell` and `make` command in later steps fails.
|
> If `openshell --version` says `command not found`, the binary is at `~/.local/bin/openshell` but isn't on PATH. Run the `export PATH=...` line above and re-source `~/.bashrc`. Without this, every `openshell` and `make` command in later steps fails.
|
||||||
@ -217,10 +217,14 @@ spec:
|
|||||||
> [!TIP]
|
> [!TIP]
|
||||||
> `make prereq` (run from `~/clinical-intelligence` after Step 2) bundles all of the checks below — Docker, Node version, OpenShell, disk space, GPU, port 11434, and NGC auth — into one command.
|
> `make prereq` (run from `~/clinical-intelligence` after Step 2) bundles all of the checks below — Docker, Node version, OpenShell, disk space, GPU, port 11434, and NGC auth — into one command.
|
||||||
|
|
||||||
**If `node --version` reports v18.x or older**, install Node.js v22 before continuing:
|
**If `node --version` reports v18.x, older, or `command not found`**, install Node.js v22 before continuing:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
|
# Download the NodeSource setup script first, then run it with sudo.
|
||||||
|
# Running it inline with `| sudo bash` does not work — the sudo context
|
||||||
|
# needs to own the entire script execution.
|
||||||
|
curl -fsSL https://deb.nodesource.com/setup_22.x -o /tmp/nodesource_setup.sh
|
||||||
|
sudo bash /tmp/nodesource_setup.sh
|
||||||
sudo apt-get install -y nodejs
|
sudo apt-get install -y nodejs
|
||||||
node --version # should now show v22.x
|
node --version # should now show v22.x
|
||||||
```
|
```
|
||||||
@ -239,10 +243,20 @@ spec:
|
|||||||
|
|
||||||
Expected: `port 11434 free`. If the line still shows a listener, something else (an old `ollama serve`, another container, etc.) owns the port — stop it, or change `OLLAMA_PORT` in `.env` (Step 2) to a free port such as `11435`. `make setup` sources `.env` and configures the sandbox provider against the override.
|
Expected: `port 11434 free`. If the line still shows a listener, something else (an old `ollama serve`, another container, etc.) owns the port — stop it, or change `OLLAMA_PORT` in `.env` (Step 2) to a free port such as `11435`. `make setup` sources `.env` and configures the sandbox provider against the override.
|
||||||
|
|
||||||
**Stale OpenShell gateway?** If you previously ran the NemoClaw playbook, an existing gateway will be silently reused under the new name. To start clean:
|
**Stale OpenShell gateway?** If you previously ran a playbook that started `openshell-gateway`, kill the process and remove the registration:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
openshell gateway destroy 2>/dev/null || true
|
pkill -f openshell-gateway 2>/dev/null || true
|
||||||
|
openshell gateway remove openshell 2>/dev/null || true
|
||||||
|
```
|
||||||
|
|
||||||
|
**Previously ran the NemoClaw playbook?** NemoClaw installs `openclaw-gateway.service` as a systemd user service that binds port 18789. If it is still running, `make setup` fails with "Port 18789 is already in use". Stop and disable it before proceeding — `make setup` will also do this automatically, but stopping it here avoids a confusing error:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl --user stop openclaw-gateway.service 2>/dev/null || true
|
||||||
|
systemctl --user disable openclaw-gateway.service 2>/dev/null || true
|
||||||
|
# Verify the port is free
|
||||||
|
ss -tlnp | grep 18789 || echo 'port 18789 free'
|
||||||
```
|
```
|
||||||
|
|
||||||
# Step 2. Copy the assets and configure
|
# Step 2. Copy the assets and configure
|
||||||
@ -325,8 +339,8 @@ spec:
|
|||||||
Expected:
|
Expected:
|
||||||
|
|
||||||
```
|
```
|
||||||
Ollama: ✓ healthy
|
Ollama (port 11434): ✓ healthy
|
||||||
OpenFold3: ✓ healthy
|
OpenFold3 (port 8000): ✓ healthy
|
||||||
```
|
```
|
||||||
|
|
||||||
OpenFold3 takes ~3 minutes to load model weights on startup. If it shows "down (may still be loading)", wait and check again.
|
OpenFold3 takes ~3 minutes to load model weights on startup. If it shows "down (may still be loading)", wait and check again.
|
||||||
@ -336,26 +350,30 @@ spec:
|
|||||||
|
|
||||||
# Step 4. Start the OpenShell gateway
|
# Step 4. Start the OpenShell gateway
|
||||||
|
|
||||||
The OpenShell gateway runs a lightweight k3s Kubernetes cluster inside Docker to manage sandboxes. On DGX Station, the kernel uses cgroup v2 with the systemd driver, but k3s defaults to cgroupfs. The flag below tells k3s to match the host:
|
OpenShell >= 0.0.40 ships `openshell-gateway`, a standalone server binary installed alongside the CLI. Start it with the Docker driver (no Kubernetes required), then register it with the CLI:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
OPENSHELL_K3S_ARGS='--kubelet-arg=cgroup-driver=systemd' openshell gateway start
|
# Start the gateway server in the background using the Docker compute driver.
|
||||||
|
# --disable-tls is safe for local-only use (loopback-bound).
|
||||||
|
nohup openshell-gateway \
|
||||||
|
--disable-tls \
|
||||||
|
--drivers docker \
|
||||||
|
--bind-address 127.0.0.1 \
|
||||||
|
--port 17670 \
|
||||||
|
> /tmp/openshell-gateway.log 2>&1 &
|
||||||
|
echo "Gateway PID: $!"
|
||||||
|
|
||||||
|
# Register the gateway with the CLI and set it as active.
|
||||||
|
openshell gateway add http://127.0.0.1:17670 --name openshell
|
||||||
```
|
```
|
||||||
|
|
||||||
Wait for the gateway's embedded k3s cluster to finish initializing (10–15 seconds after `gateway start` returns), then verify:
|
Verify the gateway is connected:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Wait until the gateway accepts connections, fail after 60s
|
|
||||||
for i in $(seq 1 30); do
|
|
||||||
if openshell status 2>/dev/null | grep -q "Connected"; then
|
|
||||||
echo "Gateway: Connected"; break
|
|
||||||
fi
|
|
||||||
sleep 2
|
|
||||||
done
|
|
||||||
openshell status
|
openshell status
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected: `Status: Connected`. If the first `openshell status` immediately after `gateway start` reports `Connection reset by peer`, that is normal — k3s is still warming up. The loop above polls until it is ready.
|
Expected: `Status: Connected`. If not connected, check `/tmp/openshell-gateway.log` for errors. The gateway typically starts in under 1 second.
|
||||||
|
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
> Step 4 configures OpenShell infrastructure (gateway). Step 5 deploys the healthcare agent into this infrastructure.
|
> Step 4 configures OpenShell infrastructure (gateway). Step 5 deploys the healthcare agent into this infrastructure.
|
||||||
@ -492,7 +510,8 @@ spec:
|
|||||||
```bash
|
```bash
|
||||||
openshell sandbox delete clinical-sandbox
|
openshell sandbox delete clinical-sandbox
|
||||||
make down
|
make down
|
||||||
openshell gateway destroy
|
pkill -f openshell-gateway 2>/dev/null || true
|
||||||
|
openshell gateway remove openshell 2>/dev/null || true
|
||||||
```
|
```
|
||||||
|
|
||||||
To also remove downloaded models and volumes:
|
To also remove downloaded models and volumes:
|
||||||
|
|||||||
@ -1,6 +1,6 @@
|
|||||||
# Local Coding Agent
|
# Local Coding Agent
|
||||||
|
|
||||||
> Run local CLI coding agents with Ollama on DGX Station (NVIDIA GB300) using glm-4.7-flash (fast) or unsloth/GLM-4.7-GGUF:Q8_0 (best quality)
|
> Run local CLI coding agents with Claude Code and Ollama on DGX Station (NVIDIA GB300) using qwen3.6:27b
|
||||||
|
|
||||||
|
|
||||||
## Table of Contents
|
## Table of Contents
|
||||||
@ -15,10 +15,10 @@
|
|||||||
|
|
||||||
## Basic idea
|
## Basic idea
|
||||||
|
|
||||||
Use Ollama on **DGX Station (NVIDIA GB300)** to run local coding models and connect a CLI coding agent. This
|
Use Ollama on **DGX Station (NVIDIA GB300)** to run a local coding model and connect a CLI coding agent. This
|
||||||
playbook uses **Claude Code** to talk to Ollama for local inference, so you can work without external cloud APIs.
|
playbook uses **Claude Code** with `ollama launch` so you can work without external cloud APIs.
|
||||||
|
|
||||||
The DGX Station GPU (reported as **NVIDIA GB300** in `nvidia-smi`) provides ample memory to run **glm-4.7-flash** (fast loading and testing) and larger models such as **unsloth/GLM-4.7-GGUF:Q8_0** (best quality), both supported on Ollama.
|
The DGX Station GPU (reported as **NVIDIA GB300** in `nvidia-smi`) provides ample memory to run **qwen3.6:27b** with Ollama for local coding-agent workflows.
|
||||||
|
|
||||||
## CLI agent
|
## CLI agent
|
||||||
|
|
||||||
@ -26,7 +26,7 @@ This playbook uses **Claude Code** as the CLI agent, connected to a local Ollama
|
|||||||
|
|
||||||
## What you'll accomplish
|
## What you'll accomplish
|
||||||
|
|
||||||
You will run a local coding model on your **DGX Station (NVIDIA GB300)** with Ollama, connect Claude Code to it, and complete a small coding task end-to-end. Use **glm-4.7-flash** (including high-quality variants) or **unsloth/GLM-4.7-GGUF:Q8_0** for best quality.
|
You will run **qwen3.6:27b** on your **DGX Station (NVIDIA GB300)** with Ollama, connect Claude Code to it, and complete a small coding task end-to-end.
|
||||||
|
|
||||||
## What to know before starting
|
## What to know before starting
|
||||||
|
|
||||||
@ -38,12 +38,9 @@ You will run a local coding model on your **DGX Station (NVIDIA GB300)** with Ol
|
|||||||
|
|
||||||
- **DGX Station** with **NVIDIA GB300** (Grace Blackwell) and NVIDIA driver; `nvidia-smi` typically shows "NVIDIA GB300"
|
- **DGX Station** with **NVIDIA GB300** (Grace Blackwell) and NVIDIA driver; `nvidia-smi` typically shows "NVIDIA GB300"
|
||||||
- Internet access to download model weights
|
- Internet access to download model weights
|
||||||
- **Ollama 0.15.0 or newer** (required for GLM-4.7-Flash; do not pin to 0.14.3)
|
- **Ollama 0.15.0 or newer**
|
||||||
- **GPU memory** on GB300 supports both recommended models:
|
- **GPU memory** on GB300 supports the recommended `qwen3.6:27b` model
|
||||||
- **glm-4.7-flash**: ~19 GB (`latest`) to ~60 GB (bf16) — **recommended for fast loading and testing**
|
- **Disk space** for the `qwen3.6:27b` model download
|
||||||
- **unsloth/GLM-4.7-GGUF:Q8_0** (Hugging Face on Ollama): larger model — **recommended for best quality**
|
|
||||||
- Other variants (e.g. `glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) fit on GB300
|
|
||||||
- **Disk space** for model downloads: plan for ~19 GB for `glm-4.7-flash:latest`, plus additional space for the Q8_0 or bf16 variants if you use them
|
|
||||||
|
|
||||||
## Time & risk
|
## Time & risk
|
||||||
|
|
||||||
@ -52,8 +49,8 @@ You will run a local coding model on your **DGX Station (NVIDIA GB300)** with Ol
|
|||||||
* Large model downloads can fail if network connectivity is unstable
|
* Large model downloads can fail if network connectivity is unstable
|
||||||
* Older Ollama versions will not load newer models
|
* Older Ollama versions will not load newer models
|
||||||
* **Rollback**: Stop Ollama and delete the downloaded model from `~/.ollama/models`
|
* **Rollback**: Stop Ollama and delete the downloaded model from `~/.ollama/models`
|
||||||
* **Last Updated:** 03/06/2026
|
* **Last Updated:** 06/12/2026
|
||||||
* Model set to glm-4.7-flash; Ollama 0.15.0+; cleanup order and docs refresh
|
* Model path set to qwen3.6:27b with `ollama launch`; Python task now uses a virtual environment
|
||||||
|
|
||||||
## Claude Code
|
## Claude Code
|
||||||
|
|
||||||
@ -85,13 +82,13 @@ curl -fsSL https://ollama.com/install.sh | sh
|
|||||||
ollama --version
|
ollama --version
|
||||||
```
|
```
|
||||||
|
|
||||||
To install a specific version (e.g. 0.15.0 or newer, required for GLM-4.7-Flash):
|
To install a specific version if needed:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.15.0 sh
|
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.15.0 sh
|
||||||
```
|
```
|
||||||
|
|
||||||
If Ollama is already present and the version is 0.15.0 or newer, simply run:
|
If Ollama is already present, simply run:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
ollama --version
|
ollama --version
|
||||||
@ -105,25 +102,12 @@ ollama version is 0.15.0
|
|||||||
|
|
||||||
## Step 3. Pull a coding model
|
## Step 3. Pull a coding model
|
||||||
|
|
||||||
**Description**: Download the model weights to your DGX Station. This playbook supports two model options on Ollama; choose one (or both) depending on whether you want **fast loading and testing** or **best quality**.
|
**Description**: Download the model weights to your DGX Station.
|
||||||
|
|
||||||
**For fast loading and testing** — **glm-4.7-flash** (~19 GB for `latest`; loads quickly; ensure Ollama 0.15.0+):
|
This playbook uses **qwen3.6:27b** with Claude Code through Ollama:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
ollama pull glm-4.7-flash
|
ollama pull qwen3.6:27b
|
||||||
```
|
|
||||||
|
|
||||||
**For best quality** — **unsloth/GLM-4.7-GGUF:Q8_0** from Hugging Face (larger, higher quality; supported on Ollama):
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ollama pull hf.co/unsloth/GLM-4.7-GGUF:Q8_0
|
|
||||||
```
|
|
||||||
|
|
||||||
**Other glm-4.7-flash variants** on GB300 (more GPU memory; bf16 is ~60 GB):
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ollama pull glm-4.7-flash:q8_0
|
|
||||||
ollama pull glm-4.7-flash:bf16
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**Expected output** (example): Progress lines followed by "success" and the model in `ollama list`:
|
**Expected output** (example): Progress lines followed by "success" and the model in `ollama list`:
|
||||||
@ -134,22 +118,15 @@ ollama list
|
|||||||
|
|
||||||
```text
|
```text
|
||||||
NAME ID SIZE MODIFIED
|
NAME ID SIZE MODIFIED
|
||||||
glm-4.7-flash:latest abc123... 19 GB 1 minute ago
|
qwen3.6:27b abc123... ... 1 minute ago
|
||||||
unsloth/GLM-4.7-GGUF:Q8_0 def456... ... ...
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 4. Test local inference
|
## Step 4. Test local inference
|
||||||
|
|
||||||
**Description**: Run a quick prompt to confirm the model loads. Use the same model name you pulled (e.g. `glm-4.7-flash` for fast testing, or `hf.co/unsloth/GLM-4.7-GGUF:Q8_0` for best quality).
|
**Description**: Run a quick prompt to confirm the model loads.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
ollama run glm-4.7-flash
|
ollama run qwen3.6:27b
|
||||||
```
|
|
||||||
|
|
||||||
Or, if you pulled the larger model:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ollama run hf.co/unsloth/GLM-4.7-GGUF:Q8_0
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Try a prompt like:
|
Try a prompt like:
|
||||||
@ -158,7 +135,7 @@ Try a prompt like:
|
|||||||
Write a short README checklist for a Python project.
|
Write a short README checklist for a Python project.
|
||||||
```
|
```
|
||||||
|
|
||||||
**Expected output**: GLM-4.7-Flash may show **Thinking...** and reasoning text before the final answer, then the model's response. This is normal; wait for the reply to complete.
|
**Expected output**: The model replies with a short README checklist.
|
||||||
|
|
||||||
**Exit the Ollama REPL** when done: type `/bye` or press **Ctrl+D**.
|
**Exit the Ollama REPL** when done: type `/bye` or press **Ctrl+D**.
|
||||||
|
|
||||||
@ -167,7 +144,7 @@ Write a short README checklist for a Python project.
|
|||||||
**Description**: Install the CLI tool that will drive the local model.
|
**Description**: Install the CLI tool that will drive the local model.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -fsSL https://claude.ai/install.sh | sh
|
curl -fsSL https://claude.ai/install.sh | bash
|
||||||
```
|
```
|
||||||
|
|
||||||
**Verify the installation**:
|
**Verify the installation**:
|
||||||
@ -184,10 +161,10 @@ claude --version
|
|||||||
larger codebases, set it to 64K tokens. This increases memory usage.
|
larger codebases, set it to 64K tokens. This increases memory usage.
|
||||||
For more details on configuring context length and other parameters, see the Ollama documentation (context window and runtime options).
|
For more details on configuring context length and other parameters, see the Ollama documentation (context window and runtime options).
|
||||||
|
|
||||||
Set the context length per session in the Ollama REPL (use the same model name you pulled, e.g. `glm-4.7-flash` or `hf.co/unsloth/GLM-4.7-GGUF:Q8_0`):
|
Set the context length per session in the Ollama REPL:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
ollama run glm-4.7-flash
|
ollama run qwen3.6:27b
|
||||||
```
|
```
|
||||||
|
|
||||||
Then, in the Ollama prompt:
|
Then, in the Ollama prompt:
|
||||||
@ -210,33 +187,13 @@ Keep this terminal open and run the next step in a new terminal.
|
|||||||
|
|
||||||
## Step 7. Connect Claude Code to Ollama
|
## Step 7. Connect Claude Code to Ollama
|
||||||
|
|
||||||
**Description**: Point Claude Code to the local Ollama server and launch it. Use the model you pulled: `glm-4.7-flash` (fast) or `hf.co/unsloth/GLM-4.7-GGUF:Q8_0` (best quality).
|
**Description**: Launch Claude Code through Ollama with the model you pulled.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
export ANTHROPIC_AUTH_TOKEN=ollama
|
ollama launch claude --model qwen3.6:27b
|
||||||
export ANTHROPIC_BASE_URL=http://localhost:11434
|
|
||||||
|
|
||||||
claude --model glm-4.7-flash
|
|
||||||
```
|
```
|
||||||
|
|
||||||
If you are using the larger model:
|
**Expected output**: Claude Code starts and uses the local Ollama model.
|
||||||
|
|
||||||
```bash
|
|
||||||
claude --model hf.co/unsloth/GLM-4.7-GGUF:Q8_0
|
|
||||||
```
|
|
||||||
|
|
||||||
- **`ANTHROPIC_AUTH_TOKEN=ollama`**: Claude Code treats the literal value `ollama` as a special token that means "use the local Ollama backend" instead of Anthropic's cloud API. No real API key is needed when using Ollama.
|
|
||||||
- **`ANTHROPIC_BASE_URL`**: Tells Claude Code to send requests to your local Ollama server at port 11434.
|
|
||||||
|
|
||||||
**Persist these variables** (optional) so you don't have to re-export every terminal session. Add to `~/.bashrc` or your shell profile (e.g. `~/.zshrc`):
|
|
||||||
|
|
||||||
```bash
|
|
||||||
echo 'export ANTHROPIC_AUTH_TOKEN=ollama' >> ~/.bashrc
|
|
||||||
echo 'export ANTHROPIC_BASE_URL=http://localhost:11434' >> ~/.bashrc
|
|
||||||
source ~/.bashrc
|
|
||||||
```
|
|
||||||
|
|
||||||
**Expected output**: Claude Code starts and uses the local model.
|
|
||||||
|
|
||||||
**Exit Claude Code** when done: type `/exit` or press **Ctrl+C**.
|
**Exit Claude Code** when done: type `/exit` or press **Ctrl+C**.
|
||||||
|
|
||||||
@ -247,15 +204,18 @@ source ~/.bashrc
|
|||||||
```bash
|
```bash
|
||||||
mkdir -p ~/cli-agent-demo
|
mkdir -p ~/cli-agent-demo
|
||||||
cd ~/cli-agent-demo
|
cd ~/cli-agent-demo
|
||||||
|
python3 -m venv .venv
|
||||||
|
source .venv/bin/activate
|
||||||
|
python3 -m pip install -U pytest
|
||||||
|
|
||||||
printf 'def add(a, b):\n """Return the sum of a and b."""\n pass\n' > math_utils.py
|
printf 'def add(a, b):\n """Return the sum of a and b."""\n pass\n' > math_utils.py
|
||||||
printf 'import math_utils\n\n\ndef test_add():\n assert math_utils.add(1, 2) == 3\n' > test_math_utils.py
|
printf 'import math_utils\n\n\ndef test_add():\n assert math_utils.add(1, 2) == 3\n' > test_math_utils.py
|
||||||
```
|
```
|
||||||
|
|
||||||
If you do not already have pytest installed:
|
If Claude Code is not already running, launch it:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m pip install -U pytest
|
ollama launch claude --model qwen3.6:27b
|
||||||
```
|
```
|
||||||
|
|
||||||
In Claude Code, enter:
|
In Claude Code, enter:
|
||||||
@ -267,7 +227,8 @@ Please implement add() in math_utils.py and make sure the test passes.
|
|||||||
**Exit Claude Code** when finished: type `/exit` or press **Ctrl+C**, then run the test:
|
**Exit Claude Code** when finished: type `/exit` or press **Ctrl+C**, then run the test:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m pytest -q
|
python3 -m pytest -q
|
||||||
|
deactivate
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected output should show the test passing.
|
Expected output should show the test passing.
|
||||||
@ -282,17 +243,9 @@ Expected output should show the test passing.
|
|||||||
**1. Remove the model** (Ollama must be running). Use the same name you pulled:
|
**1. Remove the model** (Ollama must be running). Use the same name you pulled:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
ollama rm glm-4.7-flash
|
ollama rm qwen3.6:27b
|
||||||
```
|
```
|
||||||
|
|
||||||
Or, for the Hugging Face model:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ollama rm hf.co/unsloth/GLM-4.7-GGUF:Q8_0
|
|
||||||
```
|
|
||||||
|
|
||||||
Use the exact tag you pulled (e.g. `glm-4.7-flash:bf16` if you used that variant).
|
|
||||||
|
|
||||||
**2. Stop the Ollama service**:
|
**2. Stop the Ollama service**:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@ -301,8 +254,6 @@ sudo systemctl stop ollama
|
|||||||
|
|
||||||
## Step 10. Next steps
|
## Step 10. Next steps
|
||||||
|
|
||||||
- **Fast loading and testing:** use **glm-4.7-flash** for quick iteration and smaller downloads.
|
|
||||||
- **Best quality:** use **unsloth/GLM-4.7-GGUF:Q8_0** (Hugging Face on Ollama) or **glm-4.7-flash** high-quality variants (`glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) on DGX Station (NVIDIA GB300).
|
|
||||||
- Use larger context (e.g. 64K–198K) for big codebases.
|
- Use larger context (e.g. 64K–198K) for big codebases.
|
||||||
- Use Claude Code on multi-file refactors or test-generation tasks.
|
- Use Claude Code on multi-file refactors or test-generation tasks.
|
||||||
|
|
||||||
@ -311,12 +262,16 @@ sudo systemctl stop ollama
|
|||||||
| Symptom | Cause | Fix |
|
| Symptom | Cause | Fix |
|
||||||
|---------|-------|-----|
|
|---------|-------|-----|
|
||||||
| `ollama: command not found` | Ollama not installed or PATH not updated | Rerun `curl -fsSL https://ollama.com/install.sh | sh` and open a new shell |
|
| `ollama: command not found` | Ollama not installed or PATH not updated | Rerun `curl -fsSL https://ollama.com/install.sh | sh` and open a new shell |
|
||||||
| Model load fails with version error | Ollama is older than 0.15.0 | Update Ollama to 0.15.0 or newer (required for GLM-4.7-Flash). Do not pin to 0.14.3. |
|
| Model load fails with version error | Ollama is older than the model requires | Update Ollama to a current stable release. Do not pin to older versions. |
|
||||||
| `model not found` in Claude Code | Model was not pulled | Run `ollama pull glm-4.7-flash` or `ollama pull hf.co/unsloth/GLM-4.7-GGUF:Q8_0` and retry. Use the same model name in `claude --model ...`. |
|
| `model not found` in Claude Code | Model was not pulled | Run `ollama pull qwen3.6:27b` and retry with `ollama launch claude --model qwen3.6:27b`. |
|
||||||
| `connection refused` to localhost:11434 | Ollama service not running | Start with `ollama serve` or `sudo systemctl start ollama` |
|
| `connection refused` to localhost:11434 | Ollama service not running | Start with `ollama serve` or `sudo systemctl start ollama` |
|
||||||
| Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station (NVIDIA GB300), ensure no other heavy GPU workloads. If OOM persists, use a smaller variant (e.g. `glm-4.7-flash:q8_0` or `glm-4.7-flash:q4_K_M`) or `OLLAMA_MAX_LOADED_MODELS=1`. |
|
| Sharded GGUF model pull fails with HTTP 400 | Ollama does not support pulling sharded GGUF models from Hugging Face | Use the documented `qwen3.6:27b` model instead: `ollama pull qwen3.6:27b`. |
|
||||||
|
| `CUDA error: context is destroyed` on a dual-GPU Station | Ollama may fail when both the GB300 and RTX PRO 6000 GPUs are visible | Run Ollama with one visible GPU. For example, set `CUDA_VISIBLE_DEVICES=1` in the Ollama service environment, restart Ollama, and rerun the playbook. |
|
||||||
|
| Claude Code edit task fails through the direct Ollama endpoint | Direct endpoint wiring can fail with some Ollama/model combinations | Launch Claude Code through Ollama instead: `ollama launch claude --model qwen3.6:27b`. |
|
||||||
|
| `externally-managed-environment` or Python package install fails | System Python blocks direct package installs | Create and activate a virtual environment, then install pytest inside it: `python3 -m venv .venv`, `source .venv/bin/activate`, `python3 -m pip install -U pytest`. |
|
||||||
|
| Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station (NVIDIA GB300), ensure no other heavy GPU workloads. If OOM persists, unload other models or set `OLLAMA_MAX_LOADED_MODELS=1`. |
|
||||||
| `claude: command not found` after install | CLI not on PATH or install script did not complete | Restart the terminal or run `source ~/.bashrc` (or your shell profile). Check the install script output for the install path and add it to PATH. |
|
| `claude: command not found` after install | CLI not on PATH or install script did not complete | Restart the terminal or run `source ~/.bashrc` (or your shell profile). Check the install script output for the install path and add it to PATH. |
|
||||||
| Claude Code install fails (Node.js / network) | Node.js missing or install script cannot download | Ensure Node.js is installed (`node --version`). If the install script fails with a network error, retry from a stable connection or download the Claude Code CLI from the official site. See [Claude Code documentation](https://claude.ai/docs) for alternatives. |
|
| Claude Code install fails (Node.js / network) | Node.js missing or install script cannot download | Ensure Node.js is installed (`node --version`). Run the installer with Bash: `curl -fsSL https://claude.ai/install.sh | bash`. If the install script fails with a network error, retry from a stable connection or download the Claude Code CLI from the official site. See [Claude Code documentation](https://claude.ai/docs) for alternatives. |
|
||||||
|
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
> DGX Station with **NVIDIA GB300** provides ample GPU memory for **glm-4.7-flash** (fast testing) and **unsloth/GLM-4.7-GGUF:Q8_0** (best quality), plus variants (e.g. `glm-4.7-flash:bf16`). Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models.
|
> DGX Station with **NVIDIA GB300** provides ample GPU memory for the documented `qwen3.6:27b` workflow. Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models.
|
||||||
|
|||||||
@ -2,7 +2,7 @@ kind: Playbook
|
|||||||
metadata:
|
metadata:
|
||||||
name: station-local-coding-agent
|
name: station-local-coding-agent
|
||||||
displayName: Local Coding Agent
|
displayName: Local Coding Agent
|
||||||
shortDescription: Run local CLI coding agents with Ollama on DGX Station (GB300 Ultra) using GLM-4.7 and GLM-4.7-Flash
|
shortDescription: Run local CLI coding agents with Claude Code and Ollama on DGX Station (NVIDIA GB300) using qwen3.6:27b
|
||||||
|
|
||||||
publisher: nvidia
|
publisher: nvidia
|
||||||
description: |
|
description: |
|
||||||
@ -17,8 +17,6 @@ metadata:
|
|||||||
- LLM
|
- LLM
|
||||||
- Ollama
|
- Ollama
|
||||||
- Claude Code
|
- Claude Code
|
||||||
- OpenCode
|
|
||||||
- Codex
|
|
||||||
|
|
||||||
attributes:
|
attributes:
|
||||||
- key: DURATION
|
- key: DURATION
|
||||||
@ -41,24 +39,18 @@ spec:
|
|||||||
content: |
|
content: |
|
||||||
# Basic idea
|
# Basic idea
|
||||||
|
|
||||||
Use Ollama on **DGX Station with GB300 Ultra** to run local coding models and connect a CLI coding agent. This
|
Use Ollama on **DGX Station (NVIDIA GB300)** to run a local coding model and connect a CLI coding agent. This
|
||||||
playbook supports three options: **Claude Code**, **OpenCode**, and **Codex CLI**. Each
|
playbook uses **Claude Code** with `ollama launch` so you can work without external cloud APIs.
|
||||||
agent talks to Ollama for local inference, so you can work without external cloud APIs.
|
|
||||||
|
|
||||||
The GB300 Ultra’s massive GPU memory lets you run **GLM-4.7** and **GLM-4.7-Flash** in high-quality variants (e.g. bf16, q8_0) for the best coding-assistant quality directly on the Station.
|
The DGX Station GPU (reported as **NVIDIA GB300** in `nvidia-smi`) provides ample memory to run **qwen3.6:27b** with Ollama for local coding-agent workflows.
|
||||||
|
|
||||||
# Choose your CLI agent
|
# CLI agent
|
||||||
|
|
||||||
Pick the tab that matches the CLI agent you want to use:
|
This playbook uses **Claude Code** as the CLI agent, connected to a local Ollama model for inference.
|
||||||
|
|
||||||
- **Claude Code**: Fastest path to a working CLI agent with a local Ollama model.
|
|
||||||
- **OpenCode**: Open-source CLI with provider configuration; this guide targets Ollama.
|
|
||||||
- **Codex CLI**: OpenAI Codex CLI configured to run against Ollama locally.
|
|
||||||
|
|
||||||
# What you'll accomplish
|
# What you'll accomplish
|
||||||
|
|
||||||
You will run a local coding model on your **DGX Station (GB300 Ultra)** with Ollama, connect it to your
|
You will run **qwen3.6:27b** on your **DGX Station (NVIDIA GB300)** with Ollama, connect Claude Code to it, and complete a small coding task end-to-end.
|
||||||
chosen CLI agent, and complete a small coding task end-to-end. You can use **GLM-4.7** or **GLM-4.7-Flash** (including high-quality variants) to take full advantage of the Station’s memory.
|
|
||||||
|
|
||||||
# What to know before starting
|
# What to know before starting
|
||||||
|
|
||||||
@ -68,13 +60,11 @@ spec:
|
|||||||
|
|
||||||
# Prerequisites
|
# Prerequisites
|
||||||
|
|
||||||
- **DGX Station** with **GB300 Ultra** (Grace Blackwell) and NVIDIA driver
|
- **DGX Station** with **NVIDIA GB300** (Grace Blackwell) and NVIDIA driver; `nvidia-smi` typically shows "NVIDIA GB300"
|
||||||
- Internet access to download model weights
|
- Internet access to download model weights
|
||||||
- Ollama 0.14.3 or newer
|
- **Ollama 0.15.0 or newer**
|
||||||
- **GPU memory** on GB300 Ultra supports GLM-4.7 and high-quality variants:
|
- **GPU memory** on GB300 supports the recommended `qwen3.6:27b` model
|
||||||
- **GLM-4.7-Flash** (30B): ~19GB (latest) to ~60GB (bf16) — recommended default for coding
|
- **Disk space** for the `qwen3.6:27b` model download
|
||||||
- **GLM-4.7** (full): use `ollama pull glm-4.7` for higher quality when available
|
|
||||||
- High-quality variants (e.g. `glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) fit comfortably on GB300 Ultra
|
|
||||||
|
|
||||||
# Time & risk
|
# Time & risk
|
||||||
|
|
||||||
@ -83,8 +73,8 @@ spec:
|
|||||||
* Large model downloads can fail if network connectivity is unstable
|
* Large model downloads can fail if network connectivity is unstable
|
||||||
* Older Ollama versions will not load newer models
|
* Older Ollama versions will not load newer models
|
||||||
* **Rollback**: Stop Ollama and delete the downloaded model from `~/.ollama/models`
|
* **Rollback**: Stop Ollama and delete the downloaded model from `~/.ollama/models`
|
||||||
* **Last Updated:** February 2025
|
* **Last Updated:** 06/12/2026
|
||||||
* Tailored for DGX Station with GB300 Ultra; added large-model recommendations
|
* Model path set to qwen3.6:27b with `ollama launch`; Python task now uses a virtual environment
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@ -101,51 +91,71 @@ spec:
|
|||||||
nvidia-smi
|
nvidia-smi
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected output should show a detected GPU (e.g. GB300 Ultra).
|
**Expected output** (example): A table showing driver version and GPU(s). On DGX Station, the GPU name may appear as **NVIDIA GB300** (without "Ultra"):
|
||||||
|
|
||||||
|
```text
|
||||||
|
+-----------------------------------------------------------------------------+
|
||||||
|
| NVIDIA-SMI 5xx.xx Driver Version: 5xx.xx CUDA Version: 12.x |
|
||||||
|
|-------------------------------+----------------------+----------------------+
|
||||||
|
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
|
||||||
|
| 0 NVIDIA GB300 On | 00000000:06:00.0 Off | 0 |
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
# Step 2. Install or update Ollama
|
# Step 2. Install or update Ollama
|
||||||
|
|
||||||
**Description**: Install Ollama or ensure it is recent enough for modern coding models.
|
**Description**: Install Ollama or ensure it is recent enough for modern coding models.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.14.3 sh
|
curl -fsSL https://ollama.com/install.sh | sh
|
||||||
ollama --version
|
ollama --version
|
||||||
```
|
```
|
||||||
|
|
||||||
If the ollama is already present and the version is 0.14.3 or newer, simply run:
|
To install a specific version if needed:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.15.0 sh
|
||||||
|
```
|
||||||
|
|
||||||
|
If Ollama is already present, simply run:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
ollama --version
|
ollama --version
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected output should show `ollama --version` as 0.14.3 or newer.
|
**Expected output** (example):
|
||||||
|
|
||||||
|
```text
|
||||||
|
ollama version is 0.15.0
|
||||||
|
```
|
||||||
|
|
||||||
# Step 3. Pull a coding model
|
# Step 3. Pull a coding model
|
||||||
|
|
||||||
**Description**: Download the model weights to your DGX Station. This playbook uses **GLM-4.7** where available.
|
**Description**: Download the model weights to your DGX Station.
|
||||||
|
|
||||||
**Recommended: GLM-4.7**:
|
This playbook uses **qwen3.6:27b** with Claude Code through Ollama:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
ollama pull glm-4.7
|
ollama pull qwen3.6:27b
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Expected output** (example): Progress lines followed by "success" and the model in `ollama list`:
|
||||||
**High-quality variants** on GB300 Ultra (use more GPU memory for better quality):
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
ollama pull glm-4.7-flash:q8_0
|
ollama list
|
||||||
ollama pull glm-4.7-flash:bf16
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected output should show your model in `ollama list`.
|
```text
|
||||||
|
NAME ID SIZE MODIFIED
|
||||||
|
qwen3.6:27b abc123... ... 1 minute ago
|
||||||
|
```
|
||||||
|
|
||||||
# Step 4. Test local inference
|
# Step 4. Test local inference
|
||||||
|
|
||||||
**Description**: Run a quick prompt to confirm the model loads. Use the same model name you pulled (e.g. `glm-4.7`).
|
**Description**: Run a quick prompt to confirm the model loads.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
ollama run glm-4.7
|
ollama run qwen3.6:27b
|
||||||
```
|
```
|
||||||
|
|
||||||
Try a prompt like:
|
Try a prompt like:
|
||||||
@ -154,26 +164,36 @@ spec:
|
|||||||
Write a short README checklist for a Python project.
|
Write a short README checklist for a Python project.
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected output should show the model responding in the terminal.
|
**Expected output**: The model replies with a short README checklist.
|
||||||
|
|
||||||
|
**Exit the Ollama REPL** when done: type `/bye` or press **Ctrl+D**.
|
||||||
|
|
||||||
# Step 5. Install Claude Code
|
# Step 5. Install Claude Code
|
||||||
|
|
||||||
**Description**: Install the CLI tool that will drive the local model.
|
**Description**: Install the CLI tool that will drive the local model.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -fsSL https://claude.ai/install.sh | sh
|
curl -fsSL https://claude.ai/install.sh | bash
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Verify the installation**:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
claude --version
|
||||||
|
```
|
||||||
|
|
||||||
|
**Expected output** (example): A version string such as `claude 0.x.x` or similar. If you see `claude: command not found`, ensure the install script added the CLI to your PATH (e.g. restart the terminal or source your shell profile); see [Troubleshooting](troubleshooting.md).
|
||||||
|
|
||||||
# Step 6. Increase context length (optional)
|
# Step 6. Increase context length (optional)
|
||||||
|
|
||||||
**Description**: Ollama defaults to a 4096 token context length. For coding agents and
|
**Description**: Ollama defaults to a 4096 token context length. For coding agents and
|
||||||
larger codebases, set it to 64K tokens. This increases memory usage.
|
larger codebases, set it to 64K tokens. This increases memory usage.
|
||||||
For more details on configuring context length, see the [Ollama documentation](https://ollama.com/docs/faq#how-can-i-increase-the-context-length).
|
For more details on configuring context length and other parameters, see the Ollama documentation (context window and runtime options).
|
||||||
|
|
||||||
Set the context length per session in the Ollama REPL:
|
Set the context length per session in the Ollama REPL:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
ollama run glm-4.7
|
ollama run qwen3.6:27b
|
||||||
```
|
```
|
||||||
|
|
||||||
Then, in the Ollama prompt:
|
Then, in the Ollama prompt:
|
||||||
@ -183,6 +203,8 @@ spec:
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Exit when done**: type `/bye` or press **Ctrl+D**.
|
||||||
|
|
||||||
Optional method (set globally when serving Ollama):
|
Optional method (set globally when serving Ollama):
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@ -194,16 +216,15 @@ spec:
|
|||||||
|
|
||||||
# Step 7. Connect Claude Code to Ollama
|
# Step 7. Connect Claude Code to Ollama
|
||||||
|
|
||||||
**Description**: Point Claude Code to the local Ollama server and launch it. Use the model you pulled (e.g. GLM-4.7 or GLM-4.7-Flash).
|
**Description**: Launch Claude Code through Ollama with the model you pulled.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
export ANTHROPIC_AUTH_TOKEN=ollama
|
ollama launch claude --model qwen3.6:27b
|
||||||
export ANTHROPIC_BASE_URL=http://localhost:11434
|
|
||||||
|
|
||||||
claude --model glm-4.7
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected output should show Claude Code starting and using the local model.
|
**Expected output**: Claude Code starts and uses the local Ollama model.
|
||||||
|
|
||||||
|
**Exit Claude Code** when done: type `/exit` or press **Ctrl+C**.
|
||||||
|
|
||||||
# Step 8. Complete a small coding task
|
# Step 8. Complete a small coding task
|
||||||
|
|
||||||
@ -212,53 +233,58 @@ spec:
|
|||||||
```bash
|
```bash
|
||||||
mkdir -p ~/cli-agent-demo
|
mkdir -p ~/cli-agent-demo
|
||||||
cd ~/cli-agent-demo
|
cd ~/cli-agent-demo
|
||||||
|
python3 -m venv .venv
|
||||||
|
source .venv/bin/activate
|
||||||
|
python3 -m pip install -U pytest
|
||||||
|
|
||||||
printf 'def add(a, b):\n """Return the sum of a and b."""\n pass\n' > math_utils.py
|
printf 'def add(a, b):\n """Return the sum of a and b."""\n pass\n' > math_utils.py
|
||||||
printf 'import math_utils\n\n\ndef test_add():\n assert math_utils.add(1, 2) == 3\n' > test_math_utils.py
|
printf 'import math_utils\n\n\ndef test_add():\n assert math_utils.add(1, 2) == 3\n' > test_math_utils.py
|
||||||
```
|
```
|
||||||
|
|
||||||
If you do not already have pytest installed:
|
If Claude Code is not already running, launch it:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m pip install -U pytest
|
ollama launch claude --model qwen3.6:27b
|
||||||
```
|
```
|
||||||
|
|
||||||
In Claude Code:
|
In Claude Code, enter:
|
||||||
|
|
||||||
```text
|
```text
|
||||||
Please implement add() in math_utils.py and make sure the test passes.
|
Please implement add() in math_utils.py and make sure the test passes.
|
||||||
```
|
```
|
||||||
|
|
||||||
Run the test:
|
**Exit Claude Code** when finished: type `/exit` or press **Ctrl+C**, then run the test:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m pytest -q
|
python3 -m pytest -q
|
||||||
|
deactivate
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected output should show the test passing.
|
Expected output should show the test passing.
|
||||||
|
|
||||||
# Step 9. Cleanup and rollback
|
# Step 9. Cleanup and rollback
|
||||||
|
|
||||||
**Description**: Remove the model and stop services if you no longer need them.
|
**Description**: Remove the model and stop the Ollama service if you no longer need them. **Remove the model first** (while the Ollama server is running), then stop the service.
|
||||||
|
|
||||||
To stop the service:
|
> [!WARNING]
|
||||||
|
> The following removes the downloaded model files from disk.
|
||||||
|
|
||||||
|
**1. Remove the model** (Ollama must be running). Use the same name you pulled:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ollama rm qwen3.6:27b
|
||||||
|
```
|
||||||
|
|
||||||
|
**2. Stop the Ollama service**:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
sudo systemctl stop ollama
|
sudo systemctl stop ollama
|
||||||
```
|
```
|
||||||
|
|
||||||
> [!WARNING]
|
|
||||||
> This will delete the downloaded model files.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ollama rm glm-4.7
|
|
||||||
```
|
|
||||||
|
|
||||||
# Step 10. Next steps
|
# Step 10. Next steps
|
||||||
|
|
||||||
- Use **GLM-4.7** or high-quality variants (`glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) on GB300 Ultra for best quality
|
- Use larger context (e.g. 64K–198K) for big codebases.
|
||||||
- Use larger context (e.g. 64K–198K) for big codebases
|
- Use Claude Code on multi-file refactors or test-generation tasks.
|
||||||
- Use Claude Code on multi-file refactors or test-generation tasks
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@ -270,18 +296,19 @@ spec:
|
|||||||
| Symptom | Cause | Fix |
|
| Symptom | Cause | Fix |
|
||||||
|---------|-------|-----|
|
|---------|-------|-----|
|
||||||
| `ollama: command not found` | Ollama not installed or PATH not updated | Rerun `curl -fsSL https://ollama.com/install.sh | sh` and open a new shell |
|
| `ollama: command not found` | Ollama not installed or PATH not updated | Rerun `curl -fsSL https://ollama.com/install.sh | sh` and open a new shell |
|
||||||
| Model load fails with version error | Ollama is older than 0.14.3 | Update Ollama to 0.14.3 or newer |
|
| Model load fails with version error | Ollama is older than the model requires | Update Ollama to a current stable release. Do not pin to older versions. |
|
||||||
| `model not found` in Claude Code | Model was not pulled | Run `ollama pull glm-4.7-flash` or `ollama pull glm-4.7` and retry |
|
| `model not found` in Claude Code | Model was not pulled | Run `ollama pull qwen3.6:27b` and retry with `ollama launch claude --model qwen3.6:27b`. |
|
||||||
| `opencode: command not found` | OpenCode not installed or PATH not updated | Install OpenCode and open a new shell |
|
| `connection refused` to localhost:11434 | Ollama service not running | Start with `ollama serve` or `sudo systemctl start ollama` |
|
||||||
| OpenCode cannot reach Ollama | `baseURL` misconfigured or Ollama not running | Set `baseURL` to `http://localhost:11434/v1` and start Ollama |
|
| Sharded GGUF model pull fails with HTTP 400 | Ollama does not support pulling sharded GGUF models from Hugging Face | Use the documented `qwen3.6:27b` model instead: `ollama pull qwen3.6:27b`. |
|
||||||
| `codex: command not found` | Codex CLI not installed or PATH not updated | Install Codex CLI and open a new shell |
|
| `CUDA error: context is destroyed` on a dual-GPU Station | Ollama may fail when both the GB300 and RTX PRO 6000 GPUs are visible | Run Ollama with one visible GPU. For example, set `CUDA_VISIBLE_DEVICES=1` in the Ollama service environment, restart Ollama, and rerun the playbook. |
|
||||||
| Codex CLI uses the wrong model/provider | `~/.codex/config.toml` not pointing to Ollama | Set `model_provider = "ollama"` and `base_url = "http://localhost:11434/v1"` |
|
| Claude Code edit task fails through the direct Ollama endpoint | Direct endpoint wiring can fail with some Ollama/model combinations | Launch Claude Code through Ollama instead: `ollama launch claude --model qwen3.6:27b`. |
|
||||||
| `connection refused` to localhost:11434 | Ollama service not running | Start with `ollama serve` or `systemctl start ollama` |
|
| `externally-managed-environment` or Python package install fails | System Python blocks direct package installs | Create and activate a virtual environment, then install pytest inside it: `python3 -m venv .venv`, `source .venv/bin/activate`, `python3 -m pip install -U pytest`. |
|
||||||
| Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station GB300 Ultra, ensure no other heavy GPU workloads. If OOM persists, use a smaller variant (e.g. `glm-4.7-flash:q8_0` or `glm-4.7-flash:q4_K_M`) or `OLLAMA_MAX_LOADED_MODELS=1`. |
|
| Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station (NVIDIA GB300), ensure no other heavy GPU workloads. If OOM persists, unload other models or set `OLLAMA_MAX_LOADED_MODELS=1`. |
|
||||||
|
| `claude: command not found` after install | CLI not on PATH or install script did not complete | Restart the terminal or run `source ~/.bashrc` (or your shell profile). Check the install script output for the install path and add it to PATH. |
|
||||||
|
| Claude Code install fails (Node.js / network) | Node.js missing or install script cannot download | Ensure Node.js is installed (`node --version`). Run the installer with Bash: `curl -fsSL https://claude.ai/install.sh | bash`. If the install script fails with a network error, retry from a stable connection or download the Claude Code CLI from the official site. See [Claude Code documentation](https://claude.ai/docs) for alternatives. |
|
||||||
|
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
> DGX Station with GB300 Ultra provides ample GPU memory for **GLM-4.7** and **GLM-4.7-Flash** in high-quality
|
> DGX Station with **NVIDIA GB300** provides ample GPU memory for the documented `qwen3.6:27b` workflow. Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models.
|
||||||
> variants (e.g. `glm-4.7-flash:bf16`). Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@ -291,31 +318,11 @@ spec:
|
|||||||
url: https://ollama.com/docs
|
url: https://ollama.com/docs
|
||||||
|
|
||||||
|
|
||||||
- name: GLM-4.7-Flash (Ollama)
|
- name: Qwen3.6 27B
|
||||||
url: https://ollama.com/library/glm-4.7-flash
|
url: https://ollama.com/library/qwen3.6
|
||||||
|
|
||||||
|
|
||||||
- name: GLM-4.7 (Ollama)
|
|
||||||
url: https://ollama.com/library/glm-4.7
|
|
||||||
|
|
||||||
|
|
||||||
- name: Claude Code + Ollama Guide
|
- name: Claude Code + Ollama Guide
|
||||||
url: https://ollama.com/blog/claude
|
url: https://ollama.com/blog/claude
|
||||||
|
|
||||||
|
|
||||||
- name: OpenCode Ollama Provider
|
|
||||||
url: https://opencode.ai/docs/providers/#ollama
|
|
||||||
|
|
||||||
|
|
||||||
- name: Codex + Ollama Guide
|
|
||||||
url: https://ollama.com/blog/codex
|
|
||||||
|
|
||||||
|
|
||||||
- name: DGX Station Documentation
|
|
||||||
url: https://docs.nvidia.com/dgx/dgx-station
|
|
||||||
|
|
||||||
|
|
||||||
- name: DGX Station Forum
|
|
||||||
url: https://forums.developer.nvidia.com/c/accelerated-computing/dgx-station
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@ -65,6 +65,8 @@ spec:
|
|||||||
|
|
||||||
| Model | Quantization | Support Status | HF Handle |
|
| Model | Quantization | Support Status | HF Handle |
|
||||||
|-------|-------------|----------------|-----------|
|
|-------|-------------|----------------|-----------|
|
||||||
|
| **DiffusionGemma 26B A4B IT** | BF16 | ✅ | [`google/diffusiongemma-26B-A4B-it`](https://huggingface.co/google/diffusiongemma-26B-A4B-it) |
|
||||||
|
| **DiffusionGemma 26B A4B IT** | NVFP4 | ✅ | [`nvidia/diffusiongemma-26B-A4B-it-NVFP4`](https://huggingface.co/nvidia/diffusiongemma-26B-A4B-it-NVFP4) |
|
||||||
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
|
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
|
||||||
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
|
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
|
||||||
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
|
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
|
||||||
@ -74,7 +76,7 @@ spec:
|
|||||||
* **Duration:** 30 minutes (longer on first run due to model download)
|
* **Duration:** 30 minutes (longer on first run due to model download)
|
||||||
* **Risks:** Model download requires HuggingFace authentication
|
* **Risks:** Model download requires HuggingFace authentication
|
||||||
* **Rollback:** Stop and remove the container to restore state
|
* **Rollback:** Stop and remove the container to restore state
|
||||||
* **Last Updated:** 05/28/2026
|
* **Last Updated:** 06/10/2026
|
||||||
* Update models
|
* Update models
|
||||||
|
|
||||||
|
|
||||||
@ -117,6 +119,12 @@ spec:
|
|||||||
docker pull nvcr.io/nvidia/vllm:26.01-py3
|
docker pull nvcr.io/nvidia/vllm:26.01-py3
|
||||||
```
|
```
|
||||||
|
|
||||||
|
For DiffusionGemma, use the vLLM custom container:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker pull vllm/vllm-openai:gemma
|
||||||
|
```
|
||||||
|
|
||||||
For Step-3.7-Flash models, pull the custom VLLM container
|
For Step-3.7-Flash models, pull the custom VLLM container
|
||||||
```bash
|
```bash
|
||||||
docker pull vllm/vllm-openai:stepfun37
|
docker pull vllm/vllm-openai:stepfun37
|
||||||
@ -144,6 +152,34 @@ spec:
|
|||||||
--gpu-memory-utilization 0.9
|
--gpu-memory-utilization 0.9
|
||||||
```
|
```
|
||||||
|
|
||||||
|
For DiffusionGemma models (e.g. `google/diffusiongemma-26B-A4B-it`), run with custom VLLM container.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker run -d \
|
||||||
|
--name vllm-server \
|
||||||
|
-p 8000:8000 \
|
||||||
|
--gpus all \
|
||||||
|
--shm-size=16g \
|
||||||
|
--ulimit memlock=-1 \
|
||||||
|
--ulimit stack=67108864 \
|
||||||
|
-e VLLM_USE_V2_MODEL_RUNNER=1 \
|
||||||
|
-e HF_TOKEN="$HF_TOKEN" \
|
||||||
|
vllm/vllm-openai:gemma ${MODEL_HANDLE} \
|
||||||
|
--gpu-memory-utilization 0.85 \
|
||||||
|
--attention-backend TRITON_ATTN \
|
||||||
|
--max-num-seqs 16 \
|
||||||
|
--diffusion-config '{"canvas_length":256}' \
|
||||||
|
--override-generation-config '{"max_new_tokens": null}' \
|
||||||
|
--load-format fastsafetensors \
|
||||||
|
--enable-prefix-caching \
|
||||||
|
--reasoning-parser gemma4 \
|
||||||
|
--default-chat-template-kwargs '{"enable_thinking": true}' \
|
||||||
|
--enable-auto-tool-choice \
|
||||||
|
--tool-call-parser gemma4
|
||||||
|
|
||||||
|
# For BF16 checkpoint add "--moe-backend triton" for better performance
|
||||||
|
```
|
||||||
|
|
||||||
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
|
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|||||||
@ -70,6 +70,8 @@ spec:
|
|||||||
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
|
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
|
||||||
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
|
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
|
||||||
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
|
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
|
||||||
|
| **Kimi-K2.5 (1T)** | NVFP4 | ✅ | [`nvidia/Kimi-K2.5-NVFP4`](https://huggingface.co/nvidia/Kimi-K2.5-NVFP4) |
|
||||||
|
| **DeepSeek-V4-Flash** | NVFP4 | ✅ | [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) |
|
||||||
|
|
||||||
# Time & risk
|
# Time & risk
|
||||||
|
|
||||||
@ -78,6 +80,7 @@ spec:
|
|||||||
* **Rollback:** Stop and remove the container to restore state
|
* **Rollback:** Stop and remove the container to restore state
|
||||||
* **Last Updated:** 06/10/2026
|
* **Last Updated:** 06/10/2026
|
||||||
* Update models
|
* Update models
|
||||||
|
* Add base configuration example, per-setting explanations, and DeepSeek-V4-Flash recipe
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@ -130,11 +133,23 @@ spec:
|
|||||||
docker pull vllm/vllm-openai:stepfun37
|
docker pull vllm/vllm-openai:stepfun37
|
||||||
```
|
```
|
||||||
|
|
||||||
|
For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, pull the **26.03** image, which includes the `--cpu-offload-params` support used below:
|
||||||
|
```bash
|
||||||
|
docker pull nvcr.io/nvidia/vllm:26.03-py3
|
||||||
|
```
|
||||||
|
|
||||||
|
For DeepSeek-V4-Flash, pull the stable DeepSeek-V4 release container. Use the **cu130** build on DGX Station (Blackwell):
|
||||||
|
```bash
|
||||||
|
docker pull vllm/vllm-openai:v0.20.0-cu130
|
||||||
|
```
|
||||||
|
|
||||||
# Step 4. Start vLLM server
|
# Step 4. Start vLLM server
|
||||||
|
|
||||||
Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
|
Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
|
||||||
|
|
||||||
For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.
|
## Base configuration (most models)
|
||||||
|
|
||||||
|
This is the recommended starting point for any model that fits entirely in VRAM on the GB300. The Qwen3-235B-A22B-NVFP4 model, for example, runs directly with this configuration.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker run -d \
|
docker run -d \
|
||||||
@ -152,6 +167,12 @@ spec:
|
|||||||
--gpu-memory-utilization 0.9
|
--gpu-memory-utilization 0.9
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Settings used:
|
||||||
|
- `--max-model-len` — maximum context length (prompt + output) per request. Larger values reserve more GPU memory for the KV cache; size it to your workload.
|
||||||
|
- `--gpu-memory-utilization 0.9` — fraction of GPU memory vLLM may use for weights and KV cache. `0.9` leaves headroom for other processes; raise toward `0.95` to fit more KV cache if the GPU is dedicated.
|
||||||
|
|
||||||
|
## DiffusionGemma 26B A4B
|
||||||
|
|
||||||
For DiffusionGemma models (e.g. `google/diffusiongemma-26B-A4B-it`), run with custom VLLM container.
|
For DiffusionGemma models (e.g. `google/diffusiongemma-26B-A4B-it`), run with custom VLLM container.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@ -180,6 +201,8 @@ spec:
|
|||||||
# For BF16 checkpoint add "--moe-backend triton" for better performance
|
# For BF16 checkpoint add "--moe-backend triton" for better performance
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Step-3.7-Flash (FP8 / NVFP4)
|
||||||
|
|
||||||
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
|
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@ -202,6 +225,94 @@ spec:
|
|||||||
--kv-cache-dtype fp8
|
--kv-cache-dtype fp8
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Settings used (in addition to the base configuration):
|
||||||
|
- `--trust-remote-code` — allows the model's custom modeling code (shipped in its repo) to load. Required for Step-3.7.
|
||||||
|
- `--reasoning-parser step3p5` — parses the model's reasoning/thinking tokens into the dedicated `reasoning_content` response field.
|
||||||
|
- `--enable-auto-tool-choice` — lets the model decide when to call a tool, enabling OpenAI-compatible function calling.
|
||||||
|
- `--tool-call-parser step3p5` — parses the model's tool-call output into structured `tool_calls`. Pairs with `--enable-auto-tool-choice`.
|
||||||
|
- `--kv-cache-dtype fp8` — stores the KV cache in FP8, roughly halving KV-cache memory versus 16-bit and allowing more concurrent/longer sequences.
|
||||||
|
|
||||||
|
## Kimi-K2.5 NVFP4 (1T) — CPU offloading
|
||||||
|
|
||||||
|
For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, run with the **26.03** NGC container. This model does not fit entirely in VRAM, so the MoE expert weights are offloaded to CPU DRAM with `--cpu-offload-gb 375 --cpu-offload-params experts`. Ensure the system has enough free DRAM to hold the offloaded weights.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker run -d \
|
||||||
|
--name vllm-server \
|
||||||
|
--gpus all \
|
||||||
|
--ipc host \
|
||||||
|
--ulimit memlock=-1 \
|
||||||
|
--ulimit stack=67108864 \
|
||||||
|
-p 8000:8000 \
|
||||||
|
-e HF_TOKEN="$HF_TOKEN" \
|
||||||
|
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
|
||||||
|
nvcr.io/nvidia/vllm:26.03-py3 \
|
||||||
|
vllm serve nvidia/Kimi-K2.5-NVFP4 \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port 8000 \
|
||||||
|
--dtype auto \
|
||||||
|
--kv-cache-dtype auto \
|
||||||
|
--gpu-memory-utilization 0.95 \
|
||||||
|
--served-model-name nvidia/Kimi-K2.5-NVFP4 \
|
||||||
|
--tensor-parallel-size 1 \
|
||||||
|
--no-enable-prefix-caching \
|
||||||
|
--trust-remote-code \
|
||||||
|
--max-model-len 40960 \
|
||||||
|
--max-num-seqs 1 \
|
||||||
|
--max-num-batched-tokens 32768 \
|
||||||
|
--cpu-offload-gb 375 \
|
||||||
|
--cpu-offload-params experts
|
||||||
|
```
|
||||||
|
|
||||||
|
Settings used (in addition to the base configuration):
|
||||||
|
- `--cpu-offload-gb 375` — amount of CPU DRAM (in GiB) vLLM may use to hold weights that don't fit in VRAM. Must be large enough for the offloaded experts; the system needs at least this much free DRAM.
|
||||||
|
- `--cpu-offload-params experts` — offloads only the MoE expert weights (the bulk of a large MoE model) to DRAM, keeping attention and other hot weights in VRAM.
|
||||||
|
- `--tensor-parallel-size 1` — single GPU; the GB300 serves the whole model.
|
||||||
|
- `--max-num-seqs 1` / `--max-num-batched-tokens 32768` — caps concurrency to one sequence and the batch token budget. With expert weights paged from DRAM, throughput is offload-bound, so a low concurrency keeps latency predictable.
|
||||||
|
- `--no-enable-prefix-caching` — disables prefix-cache reuse. Offloaded experts make the memory budget tight, so the cache is turned off here rather than spent on KV reuse.
|
||||||
|
- `--kv-cache-dtype auto` / `--dtype auto` — let vLLM pick the KV-cache and compute dtypes from the model's quantization (NVFP4).
|
||||||
|
|
||||||
|
## DeepSeek-V4-Flash — MTP + agentic
|
||||||
|
|
||||||
|
For DeepSeek-V4-Flash, run with the stable **v0.20.0-cu130** container. This recipe targets agentic workloads and enables Multi-Token Prediction (MTP) speculative decoding. On a single GB300 (TP1) the MoE expert-parallel path is used; the `deep_gemm_mega_moe` backend from some internal recipes is not needed at TP1 and is omitted here.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker run -d \
|
||||||
|
--name vllm-server \
|
||||||
|
--gpus all \
|
||||||
|
--ipc host \
|
||||||
|
--ulimit memlock=-1 \
|
||||||
|
--ulimit stack=67108864 \
|
||||||
|
-p 8000:8000 \
|
||||||
|
-e HF_TOKEN="$HF_TOKEN" \
|
||||||
|
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
|
||||||
|
vllm/vllm-openai:v0.20.0-cu130 \
|
||||||
|
deepseek-ai/DeepSeek-V4-Flash \
|
||||||
|
--enable-expert-parallel \
|
||||||
|
--kv-cache-dtype fp8 \
|
||||||
|
--trust-remote-code \
|
||||||
|
--block-size 256 \
|
||||||
|
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
|
||||||
|
--attention_config.use_fp4_indexer_cache True \
|
||||||
|
--tokenizer-mode deepseek_v4 \
|
||||||
|
--tool-call-parser deepseek_v4 \
|
||||||
|
--enable-auto-tool-choice \
|
||||||
|
--reasoning-parser deepseek_v4 \
|
||||||
|
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
|
||||||
|
--max-model-len 32768
|
||||||
|
```
|
||||||
|
|
||||||
|
Settings used (in addition to the base configuration):
|
||||||
|
- `--enable-expert-parallel` — shards the MoE experts across the available GPU(s) using expert parallelism, the recommended MoE execution path for DeepSeek-V4.
|
||||||
|
- `--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'` — enables **MTP (Multi-Token Prediction)** speculative decoding: the model proposes 3 tokens per step that are verified in a single forward pass, cutting latency for accepted tokens.
|
||||||
|
- `--kv-cache-dtype fp8` — FP8 KV cache to fit more concurrent/longer sequences.
|
||||||
|
- `--block-size 256` — KV-cache page size in tokens. DeepSeek-V4 uses multiple KV-cache groups; `256` matches the recipe validated on Station.
|
||||||
|
- `--attention_config.use_fp4_indexer_cache True` — enables the FP4 indexer cache used by DeepSeek-V4's attention. (Drop this flag on platforms without native FP4, e.g. Hopper.)
|
||||||
|
- `--tokenizer-mode deepseek_v4` / `--tool-call-parser deepseek_v4` / `--reasoning-parser deepseek_v4` — DeepSeek-V4-specific tokenizer, tool-call, and reasoning parsers.
|
||||||
|
- `--enable-auto-tool-choice` — OpenAI-compatible function calling for agentic use.
|
||||||
|
- `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` — uses full + piecewise CUDA graph capture and enables all custom ops for lower per-step overhead.
|
||||||
|
- **Prefix caching is left enabled (the vLLM default).** For agentic workloads with large shared prefixes (e.g. a 32k system/context prefix) at low batch sizes (~BS 3–4), prefix caching gives a significant throughput boost by reusing the cached prefix across requests. Some internal recipes carry `--no-enable-prefix-caching`, but that was inherited from random-data benchmarking and is not recommended for agentic use here.
|
||||||
|
|
||||||
Check the server logs for startup progress:
|
Check the server logs for startup progress:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user