chore: Regenerate all playbooks

2026-06-24 23:29:31 +00:00 · 2026-06-24 15:31:38 +00:00 · 2026-06-24 15:31:38 +00:00 · 0c6aab8e63
commit 0c6aab8e63
parent 797933babb
5 changed files with 332 additions and 204 deletions
--- a/nvidia/station-healthcare-agent/endpoint-production.yaml
+++ b/nvidia/station-healthcare-agent/endpoint-production.yaml
@ -209,7 +209,7 @@ spec:
        df -h /
        ```
-        Expected: Blackwell Ultra GPU, Docker >= 23.0.1, **Node.js v22.x** (the DGX Station ships with v18 — see below), OpenShell >= 0.0.33, and **at least 200 GB free** on `/` (86 GB model + Docker images + working space).
+        Expected: Blackwell Ultra GPU, Docker >= 23.0.1, **Node.js v22.x**, OpenShell >= 0.0.33, and **at least 200 GB free** on `/` (86 GB model + Docker images + working space).
        > [!WARNING]
        > If `openshell --version` says `command not found`, the binary is at `~/.local/bin/openshell` but isn't on PATH. Run the `export PATH=...` line above and re-source `~/.bashrc`. Without this, every `openshell` and `make` command in later steps fails.
@ -217,10 +217,14 @@ spec:
        > [!TIP]
        > `make prereq` (run from `~/clinical-intelligence` after Step 2) bundles all of the checks below — Docker, Node version, OpenShell, disk space, GPU, port 11434, and NGC auth — into one command.
-        **If `node --version` reports v18.x or older**, install Node.js v22 before continuing:
+        **If `node --version` reports v18.x, older, or `command not found`**, install Node.js v22 before continuing:
        ```bash
-        curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
+        # Download the NodeSource setup script first, then run it with sudo.
        # Running it inline with `| sudo bash` does not work — the sudo context
        # needs to own the entire script execution.
        curl -fsSL https://deb.nodesource.com/setup_22.x -o /tmp/nodesource_setup.sh
        sudo bash /tmp/nodesource_setup.sh
        sudo apt-get install -y nodejs
        node --version   # should now show v22.x
        ```
@ -239,10 +243,20 @@ spec:
        Expected: `port 11434 free`. If the line still shows a listener, something else (an old `ollama serve`, another container, etc.) owns the port — stop it, or change `OLLAMA_PORT` in `.env` (Step 2) to a free port such as `11435`. `make setup` sources `.env` and configures the sandbox provider against the override.
-        **Stale OpenShell gateway?** If you previously ran the NemoClaw playbook, an existing gateway will be silently reused under the new name. To start clean:
+        **Stale OpenShell gateway?** If you previously ran a playbook that started `openshell-gateway`, kill the process and remove the registration:
        ```bash
-        openshell gateway destroy 2>/dev/null || true
+        pkill -f openshell-gateway 2>/dev/null || true
        openshell gateway remove openshell 2>/dev/null || true
        ```
        **Previously ran the NemoClaw playbook?** NemoClaw installs `openclaw-gateway.service` as a systemd user service that binds port 18789. If it is still running, `make setup` fails with "Port 18789 is already in use". Stop and disable it before proceeding — `make setup` will also do this automatically, but stopping it here avoids a confusing error:
        ```bash
        systemctl --user stop    openclaw-gateway.service 2>/dev/null || true
        systemctl --user disable openclaw-gateway.service 2>/dev/null || true
        # Verify the port is free
        ss -tlnp | grep 18789 || echo 'port 18789 free'
        ```
        # Step 2. Copy the assets and configure
@ -325,8 +339,8 @@ spec:
        Expected:
        ```
-          Ollama:    ✓ healthy
+          Ollama (port 11434):     ✓ healthy
-          OpenFold3: ✓ healthy
+          OpenFold3 (port 8000):  ✓ healthy
        ```
        OpenFold3 takes ~3 minutes to load model weights on startup. If it shows "down (may still be loading)", wait and check again.
@ -336,26 +350,30 @@ spec:
        # Step 4. Start the OpenShell gateway
-        The OpenShell gateway runs a lightweight k3s Kubernetes cluster inside Docker to manage sandboxes. On DGX Station, the kernel uses cgroup v2 with the systemd driver, but k3s defaults to cgroupfs. The flag below tells k3s to match the host:
+        OpenShell >= 0.0.40 ships `openshell-gateway`, a standalone server binary installed alongside the CLI. Start it with the Docker driver (no Kubernetes required), then register it with the CLI:
        ```bash
-        OPENSHELL_K3S_ARGS='--kubelet-arg=cgroup-driver=systemd' openshell gateway start
+        # Start the gateway server in the background using the Docker compute driver.
        # --disable-tls is safe for local-only use (loopback-bound).
        nohup openshell-gateway \
            --disable-tls \
            --drivers docker \
            --bind-address 127.0.0.1 \
            --port 17670 \
            > /tmp/openshell-gateway.log 2>&1 &
        echo "Gateway PID: $!"
        # Register the gateway with the CLI and set it as active.
        openshell gateway add http://127.0.0.1:17670 --name openshell
        ```
-        Wait for the gateway's embedded k3s cluster to finish initializing (10–15 seconds after `gateway start` returns), then verify:
+        Verify the gateway is connected:
        ```bash
        # Wait until the gateway accepts connections, fail after 60s
        for i in $(seq 1 30); do
            if openshell status 2>/dev/null | grep -q "Connected"; then
                echo "Gateway: Connected"; break
            fi
            sleep 2
        done
        openshell status
        ```
-        Expected: `Status: Connected`. If the first `openshell status` immediately after `gateway start` reports `Connection reset by peer`, that is normal — k3s is still warming up. The loop above polls until it is ready.
+        Expected: `Status: Connected`. If not connected, check `/tmp/openshell-gateway.log` for errors. The gateway typically starts in under 1 second.
        > [!NOTE]
        > Step 4 configures OpenShell infrastructure (gateway). Step 5 deploys the healthcare agent into this infrastructure.
@ -492,7 +510,8 @@ spec:
        ```bash
        openshell sandbox delete clinical-sandbox
        make down
-        openshell gateway destroy
+        pkill -f openshell-gateway 2>/dev/null || true
        openshell gateway remove openshell 2>/dev/null || true
        ```
        To also remove downloaded models and volumes:
--- a/nvidia/station-local-coding-agent/README.md
+++ b/nvidia/station-local-coding-agent/README.md
@ -1,6 +1,6 @@
 # Local Coding Agent
-> Run local CLI coding agents with Ollama on DGX Station (NVIDIA GB300) using glm-4.7-flash (fast) or unsloth/GLM-4.7-GGUF:Q8_0 (best quality)
+> Run local CLI coding agents with Claude Code and Ollama on DGX Station (NVIDIA GB300) using qwen3.6:27b
 ## Table of Contents
@ -15,10 +15,10 @@
 ## Basic idea
-Use Ollama on **DGX Station (NVIDIA GB300)** to run local coding models and connect a CLI coding agent. This
+Use Ollama on **DGX Station (NVIDIA GB300)** to run a local coding model and connect a CLI coding agent. This
-playbook uses **Claude Code** to talk to Ollama for local inference, so you can work without external cloud APIs.
+playbook uses **Claude Code** with `ollama launch` so you can work without external cloud APIs.
-The DGX Station GPU (reported as **NVIDIA GB300** in `nvidia-smi`) provides ample memory to run **glm-4.7-flash** (fast loading and testing) and larger models such as **unsloth/GLM-4.7-GGUF:Q8_0** (best quality), both supported on Ollama.
+The DGX Station GPU (reported as **NVIDIA GB300** in `nvidia-smi`) provides ample memory to run **qwen3.6:27b** with Ollama for local coding-agent workflows.
 ## CLI agent
@ -26,7 +26,7 @@ This playbook uses **Claude Code** as the CLI agent, connected to a local Ollama
 ## What you'll accomplish
-You will run a local coding model on your **DGX Station (NVIDIA GB300)** with Ollama, connect Claude Code to it, and complete a small coding task end-to-end. Use **glm-4.7-flash** (including high-quality variants) or **unsloth/GLM-4.7-GGUF:Q8_0** for best quality.
+You will run **qwen3.6:27b** on your **DGX Station (NVIDIA GB300)** with Ollama, connect Claude Code to it, and complete a small coding task end-to-end.
 ## What to know before starting
@ -38,12 +38,9 @@ You will run a local coding model on your **DGX Station (NVIDIA GB300)** with Ol
 - **DGX Station** with **NVIDIA GB300** (Grace Blackwell) and NVIDIA driver; `nvidia-smi` typically shows "NVIDIA GB300"
 - Internet access to download model weights
- **Ollama 0.15.0 or newer** (required for GLM-4.7-Flash; do not pin to 0.14.3)
+- **Ollama 0.15.0 or newer**
- **GPU memory** on GB300 supports both recommended models:
+- **GPU memory** on GB300 supports the recommended `qwen3.6:27b` model
-  - **glm-4.7-flash**: ~19 GB (`latest`) to ~60 GB (bf16) — **recommended for fast loading and testing**
+- **Disk space** for the `qwen3.6:27b` model download
  - **unsloth/GLM-4.7-GGUF:Q8_0** (Hugging Face on Ollama): larger model — **recommended for best quality**
  - Other variants (e.g. `glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) fit on GB300
 - **Disk space** for model downloads: plan for ~19 GB for `glm-4.7-flash:latest`, plus additional space for the Q8_0 or bf16 variants if you use them
 ## Time & risk
@ -52,8 +49,8 @@ You will run a local coding model on your **DGX Station (NVIDIA GB300)** with Ol
  * Large model downloads can fail if network connectivity is unstable
  * Older Ollama versions will not load newer models
 * **Rollback**: Stop Ollama and delete the downloaded model from `~/.ollama/models`
-* **Last Updated:** 03/06/2026
+* **Last Updated:** 06/12/2026
-  * Model set to glm-4.7-flash; Ollama 0.15.0+; cleanup order and docs refresh
+  * Model path set to qwen3.6:27b with `ollama launch`; Python task now uses a virtual environment
 ## Claude Code
@ -85,13 +82,13 @@ curl -fsSL https://ollama.com/install.sh | sh
 ollama --version
 ```
-To install a specific version (e.g. 0.15.0 or newer, required for GLM-4.7-Flash):
+To install a specific version if needed:
 ```bash
 curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.15.0 sh
 ```
-If Ollama is already present and the version is 0.15.0 or newer, simply run:
+If Ollama is already present, simply run:
 ```bash
 ollama --version
@ -105,25 +102,12 @@ ollama version is 0.15.0
 ## Step 3. Pull a coding model
-**Description**: Download the model weights to your DGX Station. This playbook supports two model options on Ollama; choose one (or both) depending on whether you want **fast loading and testing** or **best quality**.
+**Description**: Download the model weights to your DGX Station.
-**For fast loading and testing** — **glm-4.7-flash** (~19 GB for `latest`; loads quickly; ensure Ollama 0.15.0+):
+This playbook uses **qwen3.6:27b** with Claude Code through Ollama:
 ```bash
-ollama pull glm-4.7-flash
+ollama pull qwen3.6:27b
 ```
 **For best quality** — **unsloth/GLM-4.7-GGUF:Q8_0** from Hugging Face (larger, higher quality; supported on Ollama):
 ```bash
 ollama pull hf.co/unsloth/GLM-4.7-GGUF:Q8_0
 ```
 **Other glm-4.7-flash variants** on GB300 (more GPU memory; bf16 is ~60 GB):
 ```bash
 ollama pull glm-4.7-flash:q8_0
 ollama pull glm-4.7-flash:bf16
 ```
 **Expected output** (example): Progress lines followed by "success" and the model in `ollama list`:
@ -134,22 +118,15 @@ ollama list
 ```text
 NAME                                ID              SIZE    MODIFIED
-glm-4.7-flash:latest                abc123...       19 GB   1 minute ago
+qwen3.6:27b                         abc123...       ...     1 minute ago
 unsloth/GLM-4.7-GGUF:Q8_0           def456...       ...    ...
 ```
 ## Step 4. Test local inference
-**Description**: Run a quick prompt to confirm the model loads. Use the same model name you pulled (e.g. `glm-4.7-flash` for fast testing, or `hf.co/unsloth/GLM-4.7-GGUF:Q8_0` for best quality).
+**Description**: Run a quick prompt to confirm the model loads.
 ```bash
-ollama run glm-4.7-flash
+ollama run qwen3.6:27b
 ```
 Or, if you pulled the larger model:
 ```bash
 ollama run hf.co/unsloth/GLM-4.7-GGUF:Q8_0
 ```
 Try a prompt like:
@ -158,7 +135,7 @@ Try a prompt like:
 Write a short README checklist for a Python project.
 ```
-**Expected output**: GLM-4.7-Flash may show **Thinking...** and reasoning text before the final answer, then the model's response. This is normal; wait for the reply to complete.
+**Expected output**: The model replies with a short README checklist.
 **Exit the Ollama REPL** when done: type `/bye` or press **Ctrl+D**.
@ -167,7 +144,7 @@ Write a short README checklist for a Python project.
 **Description**: Install the CLI tool that will drive the local model.
 ```bash
-curl -fsSL https://claude.ai/install.sh | sh
+curl -fsSL https://claude.ai/install.sh | bash
 ```
 **Verify the installation**:
@ -184,10 +161,10 @@ claude --version
 larger codebases, set it to 64K tokens. This increases memory usage.
 For more details on configuring context length and other parameters, see the Ollama documentation (context window and runtime options).
-Set the context length per session in the Ollama REPL (use the same model name you pulled, e.g. `glm-4.7-flash` or `hf.co/unsloth/GLM-4.7-GGUF:Q8_0`):
+Set the context length per session in the Ollama REPL:
 ```bash
-ollama run glm-4.7-flash
+ollama run qwen3.6:27b
 ```
 Then, in the Ollama prompt:
@ -210,33 +187,13 @@ Keep this terminal open and run the next step in a new terminal.
 ## Step 7. Connect Claude Code to Ollama
-**Description**: Point Claude Code to the local Ollama server and launch it. Use the model you pulled: `glm-4.7-flash` (fast) or `hf.co/unsloth/GLM-4.7-GGUF:Q8_0` (best quality).
+**Description**: Launch Claude Code through Ollama with the model you pulled.
 ```bash
-export ANTHROPIC_AUTH_TOKEN=ollama
+ollama launch claude --model qwen3.6:27b
 export ANTHROPIC_BASE_URL=http://localhost:11434
 claude --model glm-4.7-flash
 ```
-If you are using the larger model:
+**Expected output**: Claude Code starts and uses the local Ollama model.
 ```bash
 claude --model hf.co/unsloth/GLM-4.7-GGUF:Q8_0
 ```
 - **`ANTHROPIC_AUTH_TOKEN=ollama`**: Claude Code treats the literal value `ollama` as a special token that means "use the local Ollama backend" instead of Anthropic's cloud API. No real API key is needed when using Ollama.
 - **`ANTHROPIC_BASE_URL`**: Tells Claude Code to send requests to your local Ollama server at port 11434.
 **Persist these variables** (optional) so you don't have to re-export every terminal session. Add to `~/.bashrc` or your shell profile (e.g. `~/.zshrc`):
 ```bash
 echo 'export ANTHROPIC_AUTH_TOKEN=ollama' >> ~/.bashrc
 echo 'export ANTHROPIC_BASE_URL=http://localhost:11434' >> ~/.bashrc
 source ~/.bashrc
 ```
 **Expected output**: Claude Code starts and uses the local model.
 **Exit Claude Code** when done: type `/exit` or press **Ctrl+C**.
@ -247,15 +204,18 @@ source ~/.bashrc
 ```bash
 mkdir -p ~/cli-agent-demo
 cd ~/cli-agent-demo
 python3 -m venv .venv
 source .venv/bin/activate
 python3 -m pip install -U pytest
 printf 'def add(a, b):\n    """Return the sum of a and b."""\n    pass\n' > math_utils.py
 printf 'import math_utils\n\n\ndef test_add():\n    assert math_utils.add(1, 2) == 3\n' > test_math_utils.py
 ```
-If you do not already have pytest installed:
+If Claude Code is not already running, launch it:
 ```bash
-python -m pip install -U pytest
+ollama launch claude --model qwen3.6:27b
 ```
 In Claude Code, enter:
@ -267,7 +227,8 @@ Please implement add() in math_utils.py and make sure the test passes.
 **Exit Claude Code** when finished: type `/exit` or press **Ctrl+C**, then run the test:
 ```bash
-python -m pytest -q
+python3 -m pytest -q
 deactivate
 ```
 Expected output should show the test passing.
@ -282,17 +243,9 @@ Expected output should show the test passing.
 **1. Remove the model** (Ollama must be running). Use the same name you pulled:
 ```bash
-ollama rm glm-4.7-flash
+ollama rm qwen3.6:27b
 ```
 Or, for the Hugging Face model:
 ```bash
 ollama rm hf.co/unsloth/GLM-4.7-GGUF:Q8_0
 ```
 Use the exact tag you pulled (e.g. `glm-4.7-flash:bf16` if you used that variant).
 **2. Stop the Ollama service**:
 ```bash
@ -301,8 +254,6 @@ sudo systemctl stop ollama
 ## Step 10. Next steps
 - **Fast loading and testing:** use **glm-4.7-flash** for quick iteration and smaller downloads.
 - **Best quality:** use **unsloth/GLM-4.7-GGUF:Q8_0** (Hugging Face on Ollama) or **glm-4.7-flash** high-quality variants (`glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) on DGX Station (NVIDIA GB300).
 - Use larger context (e.g. 64K–198K) for big codebases.
 - Use Claude Code on multi-file refactors or test-generation tasks.
@ -311,12 +262,16 @@ sudo systemctl stop ollama
 | Symptom | Cause | Fix |
 |---------|-------|-----|
 | `ollama: command not found` | Ollama not installed or PATH not updated | Rerun `curl -fsSL https://ollama.com/install.sh | sh` and open a new shell |
-| Model load fails with version error | Ollama is older than 0.15.0 | Update Ollama to 0.15.0 or newer (required for GLM-4.7-Flash). Do not pin to 0.14.3. |
+| Model load fails with version error | Ollama is older than the model requires | Update Ollama to a current stable release. Do not pin to older versions. |
-| `model not found` in Claude Code | Model was not pulled | Run `ollama pull glm-4.7-flash` or `ollama pull hf.co/unsloth/GLM-4.7-GGUF:Q8_0` and retry. Use the same model name in `claude --model ...`. |
+| `model not found` in Claude Code | Model was not pulled | Run `ollama pull qwen3.6:27b` and retry with `ollama launch claude --model qwen3.6:27b`. |
 | `connection refused` to localhost:11434 | Ollama service not running | Start with `ollama serve` or `sudo systemctl start ollama` |
-| Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station (NVIDIA GB300), ensure no other heavy GPU workloads. If OOM persists, use a smaller variant (e.g. `glm-4.7-flash:q8_0` or `glm-4.7-flash:q4_K_M`) or `OLLAMA_MAX_LOADED_MODELS=1`. |
+| Sharded GGUF model pull fails with HTTP 400 | Ollama does not support pulling sharded GGUF models from Hugging Face | Use the documented `qwen3.6:27b` model instead: `ollama pull qwen3.6:27b`. |
 | `CUDA error: context is destroyed` on a dual-GPU Station | Ollama may fail when both the GB300 and RTX PRO 6000 GPUs are visible | Run Ollama with one visible GPU. For example, set `CUDA_VISIBLE_DEVICES=1` in the Ollama service environment, restart Ollama, and rerun the playbook. |
 | Claude Code edit task fails through the direct Ollama endpoint | Direct endpoint wiring can fail with some Ollama/model combinations | Launch Claude Code through Ollama instead: `ollama launch claude --model qwen3.6:27b`. |
 | `externally-managed-environment` or Python package install fails | System Python blocks direct package installs | Create and activate a virtual environment, then install pytest inside it: `python3 -m venv .venv`, `source .venv/bin/activate`, `python3 -m pip install -U pytest`. |
 | Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station (NVIDIA GB300), ensure no other heavy GPU workloads. If OOM persists, unload other models or set `OLLAMA_MAX_LOADED_MODELS=1`. |
 | `claude: command not found` after install | CLI not on PATH or install script did not complete | Restart the terminal or run `source ~/.bashrc` (or your shell profile). Check the install script output for the install path and add it to PATH. |
-| Claude Code install fails (Node.js / network) | Node.js missing or install script cannot download | Ensure Node.js is installed (`node --version`). If the install script fails with a network error, retry from a stable connection or download the Claude Code CLI from the official site. See [Claude Code documentation](https://claude.ai/docs) for alternatives. |
+| Claude Code install fails (Node.js / network) | Node.js missing or install script cannot download | Ensure Node.js is installed (`node --version`). Run the installer with Bash: `curl -fsSL https://claude.ai/install.sh | bash`. If the install script fails with a network error, retry from a stable connection or download the Claude Code CLI from the official site. See [Claude Code documentation](https://claude.ai/docs) for alternatives. |
 > [!NOTE]
-> DGX Station with **NVIDIA GB300** provides ample GPU memory for **glm-4.7-flash** (fast testing) and **unsloth/GLM-4.7-GGUF:Q8_0** (best quality), plus variants (e.g. `glm-4.7-flash:bf16`). Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models.
+> DGX Station with **NVIDIA GB300** provides ample GPU memory for the documented `qwen3.6:27b` workflow. Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models.
--- a/nvidia/station-local-coding-agent/endpoint-test.yaml
+++ b/nvidia/station-local-coding-agent/endpoint-test.yaml
@ -2,7 +2,7 @@ kind: Playbook
 metadata:
  name: station-local-coding-agent
  displayName: Local Coding Agent
-  shortDescription: Run local CLI coding agents with Ollama on DGX Station (GB300 Ultra) using GLM-4.7 and GLM-4.7-Flash
+  shortDescription: Run local CLI coding agents with Claude Code and Ollama on DGX Station (NVIDIA GB300) using qwen3.6:27b
  publisher: nvidia
  description: |
@ -17,8 +17,6 @@ metadata:
  - LLM
  - Ollama
  - Claude Code
  - OpenCode
  - Codex
  attributes:
  - key: DURATION
@ -41,24 +39,18 @@ spec:
      content: |
        # Basic idea
-        Use Ollama on **DGX Station with GB300 Ultra** to run local coding models and connect a CLI coding agent. This
+        Use Ollama on **DGX Station (NVIDIA GB300)** to run a local coding model and connect a CLI coding agent. This
-        playbook supports three options: **Claude Code**, **OpenCode**, and **Codex CLI**. Each
+        playbook uses **Claude Code** with `ollama launch` so you can work without external cloud APIs.
        agent talks to Ollama for local inference, so you can work without external cloud APIs.
-        The GB300 Ultra’s massive GPU memory lets you run **GLM-4.7** and **GLM-4.7-Flash** in high-quality variants (e.g. bf16, q8_0) for the best coding-assistant quality directly on the Station.
+        The DGX Station GPU (reported as **NVIDIA GB300** in `nvidia-smi`) provides ample memory to run **qwen3.6:27b** with Ollama for local coding-agent workflows.
-        # Choose your CLI agent
+        # CLI agent
-        Pick the tab that matches the CLI agent you want to use:
+        This playbook uses **Claude Code** as the CLI agent, connected to a local Ollama model for inference.
        - **Claude Code**: Fastest path to a working CLI agent with a local Ollama model.
        - **OpenCode**: Open-source CLI with provider configuration; this guide targets Ollama.
        - **Codex CLI**: OpenAI Codex CLI configured to run against Ollama locally.
        # What you'll accomplish
-        You will run a local coding model on your **DGX Station (GB300 Ultra)** with Ollama, connect it to your
+        You will run **qwen3.6:27b** on your **DGX Station (NVIDIA GB300)** with Ollama, connect Claude Code to it, and complete a small coding task end-to-end.
        chosen CLI agent, and complete a small coding task end-to-end. You can use **GLM-4.7** or **GLM-4.7-Flash** (including high-quality variants) to take full advantage of the Station’s memory.
        # What to know before starting
@ -68,13 +60,11 @@ spec:
        # Prerequisites
-        - **DGX Station** with **GB300 Ultra** (Grace Blackwell) and NVIDIA driver
+        - **DGX Station** with **NVIDIA GB300** (Grace Blackwell) and NVIDIA driver; `nvidia-smi` typically shows "NVIDIA GB300"
        - Internet access to download model weights
-        - Ollama 0.14.3 or newer
+        - **Ollama 0.15.0 or newer**
-        - **GPU memory** on GB300 Ultra supports GLM-4.7 and high-quality variants:
+        - **GPU memory** on GB300 supports the recommended `qwen3.6:27b` model
-          - **GLM-4.7-Flash** (30B): ~19GB (latest) to ~60GB (bf16) — recommended default for coding
+        - **Disk space** for the `qwen3.6:27b` model download
          - **GLM-4.7** (full): use `ollama pull glm-4.7` for higher quality when available
          - High-quality variants (e.g. `glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) fit comfortably on GB300 Ultra
        # Time & risk
@ -83,8 +73,8 @@ spec:
          * Large model downloads can fail if network connectivity is unstable
          * Older Ollama versions will not load newer models
        * **Rollback**: Stop Ollama and delete the downloaded model from `~/.ollama/models`
-        * **Last Updated:** February 2025
+        * **Last Updated:** 06/12/2026
-          * Tailored for DGX Station with GB300 Ultra; added large-model recommendations
+          * Model path set to qwen3.6:27b with `ollama launch`; Python task now uses a virtual environment
@ -101,51 +91,71 @@ spec:
        nvidia-smi
        ```
-        Expected output should show a detected GPU (e.g. GB300 Ultra).
+        **Expected output** (example): A table showing driver version and GPU(s). On DGX Station, the GPU name may appear as **NVIDIA GB300** (without "Ultra"):
        ```text
        +-----------------------------------------------------------------------------+
        | NVIDIA-SMI 5xx.xx    Driver Version: 5xx.xx    CUDA Version: 12.x          |
        |-------------------------------+----------------------+----------------------+
        | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
        |   0  NVIDIA GB300        On   | 00000000:06:00.0 Off |                    0 |
        ...
        ```
        # Step 2. Install or update Ollama
        **Description**: Install Ollama or ensure it is recent enough for modern coding models.
        ```bash
-        curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.14.3 sh
+        curl -fsSL https://ollama.com/install.sh | sh
        ollama --version
        ```
-        If the ollama is already present and the version is 0.14.3 or newer, simply run:
+        To install a specific version if needed:
        ```bash
        curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.15.0 sh
        ```
        If Ollama is already present, simply run:
        ```bash
        ollama --version
        ```
-        Expected output should show `ollama --version` as 0.14.3 or newer.
+        **Expected output** (example):
        ```text
        ollama version is 0.15.0
        ```
        # Step 3. Pull a coding model
-        **Description**: Download the model weights to your DGX Station. This playbook uses **GLM-4.7** where available.
+        **Description**: Download the model weights to your DGX Station.
-        **Recommended: GLM-4.7**:
+        This playbook uses **qwen3.6:27b** with Claude Code through Ollama:
        ```bash
-        ollama pull glm-4.7
+        ollama pull qwen3.6:27b
        ```
-        
+        **Expected output** (example): Progress lines followed by "success" and the model in `ollama list`:
        **High-quality variants** on GB300 Ultra (use more GPU memory for better quality):
        ```bash
-        ollama pull glm-4.7-flash:q8_0
+        ollama list
        ollama pull glm-4.7-flash:bf16
        ```
-        Expected output should show your model in `ollama list`.
+        ```text
        NAME                                ID              SIZE    MODIFIED
        qwen3.6:27b                         abc123...       ...     1 minute ago
        ```
        # Step 4. Test local inference
-        **Description**: Run a quick prompt to confirm the model loads. Use the same model name you pulled (e.g. `glm-4.7`).
+        **Description**: Run a quick prompt to confirm the model loads.
        ```bash
-        ollama run glm-4.7
+        ollama run qwen3.6:27b
        ```
        Try a prompt like:
@ -154,26 +164,36 @@ spec:
        Write a short README checklist for a Python project.
        ```
-        Expected output should show the model responding in the terminal.
+        **Expected output**: The model replies with a short README checklist.
        **Exit the Ollama REPL** when done: type `/bye` or press **Ctrl+D**.
        # Step 5. Install Claude Code
        **Description**: Install the CLI tool that will drive the local model.
        ```bash
-        curl -fsSL https://claude.ai/install.sh | sh
+        curl -fsSL https://claude.ai/install.sh | bash
        ```
        **Verify the installation**:
        ```bash
        claude --version
        ```
        **Expected output** (example): A version string such as `claude 0.x.x` or similar. If you see `claude: command not found`, ensure the install script added the CLI to your PATH (e.g. restart the terminal or source your shell profile); see [Troubleshooting](troubleshooting.md).
        # Step 6. Increase context length (optional)
        **Description**: Ollama defaults to a 4096 token context length. For coding agents and
        larger codebases, set it to 64K tokens. This increases memory usage.
-        For more details on configuring context length, see the [Ollama documentation](https://ollama.com/docs/faq#how-can-i-increase-the-context-length).
+        For more details on configuring context length and other parameters, see the Ollama documentation (context window and runtime options).
        Set the context length per session in the Ollama REPL:
        ```bash
-        ollama run glm-4.7
+        ollama run qwen3.6:27b
        ```
        Then, in the Ollama prompt:
@ -183,6 +203,8 @@ spec:
        ```
        **Exit when done**: type `/bye` or press **Ctrl+D**.
        Optional method (set globally when serving Ollama):
        ```bash
@ -194,16 +216,15 @@ spec:
        # Step 7. Connect Claude Code to Ollama
-        **Description**: Point Claude Code to the local Ollama server and launch it. Use the model you pulled (e.g. GLM-4.7 or GLM-4.7-Flash).
+        **Description**: Launch Claude Code through Ollama with the model you pulled.
        ```bash
-        export ANTHROPIC_AUTH_TOKEN=ollama
+        ollama launch claude --model qwen3.6:27b
        export ANTHROPIC_BASE_URL=http://localhost:11434
        claude --model glm-4.7
        ```
-        Expected output should show Claude Code starting and using the local model.
+        **Expected output**: Claude Code starts and uses the local Ollama model.
        **Exit Claude Code** when done: type `/exit` or press **Ctrl+C**.
        # Step 8. Complete a small coding task
@ -212,53 +233,58 @@ spec:
        ```bash
        mkdir -p ~/cli-agent-demo
        cd ~/cli-agent-demo
        python3 -m venv .venv
        source .venv/bin/activate
        python3 -m pip install -U pytest
        printf 'def add(a, b):\n    """Return the sum of a and b."""\n    pass\n' > math_utils.py
        printf 'import math_utils\n\n\ndef test_add():\n    assert math_utils.add(1, 2) == 3\n' > test_math_utils.py
        ```
-        If you do not already have pytest installed:
+        If Claude Code is not already running, launch it:
        ```bash
-        python -m pip install -U pytest
+        ollama launch claude --model qwen3.6:27b
        ```
-        In Claude Code:
+        In Claude Code, enter:
        ```text
        Please implement add() in math_utils.py and make sure the test passes.
        ```
-        Run the test:
+        **Exit Claude Code** when finished: type `/exit` or press **Ctrl+C**, then run the test:
        ```bash
-        python -m pytest -q
+        python3 -m pytest -q
        deactivate
        ```
        Expected output should show the test passing.
        # Step 9. Cleanup and rollback
-        **Description**: Remove the model and stop services if you no longer need them.
+        **Description**: Remove the model and stop the Ollama service if you no longer need them. **Remove the model first** (while the Ollama server is running), then stop the service.
-        To stop the service:
+        > [!WARNING]
        > The following removes the downloaded model files from disk.
        **1. Remove the model** (Ollama must be running). Use the same name you pulled:
        ```bash
        ollama rm qwen3.6:27b
        ```
        **2. Stop the Ollama service**:
        ```bash
        sudo systemctl stop ollama
        ```
        > [!WARNING]
        > This will delete the downloaded model files.
        ```bash
        ollama rm glm-4.7
        ```
        # Step 10. Next steps
-        - Use **GLM-4.7** or high-quality variants (`glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) on GB300 Ultra for best quality
+        - Use larger context (e.g. 64K–198K) for big codebases.
-        - Use larger context (e.g. 64K–198K) for big codebases
+        - Use Claude Code on multi-file refactors or test-generation tasks.
        - Use Claude Code on multi-file refactors or test-generation tasks
@ -270,18 +296,19 @@ spec:
        | Symptom | Cause | Fix |
        |---------|-------|-----|
        | `ollama: command not found` | Ollama not installed or PATH not updated | Rerun `curl -fsSL https://ollama.com/install.sh | sh` and open a new shell |
-        | Model load fails with version error | Ollama is older than 0.14.3 | Update Ollama to 0.14.3 or newer |
+        | Model load fails with version error | Ollama is older than the model requires | Update Ollama to a current stable release. Do not pin to older versions. |
-        | `model not found` in Claude Code | Model was not pulled | Run `ollama pull glm-4.7-flash` or `ollama pull glm-4.7` and retry |
+        | `model not found` in Claude Code | Model was not pulled | Run `ollama pull qwen3.6:27b` and retry with `ollama launch claude --model qwen3.6:27b`. |
-        | `opencode: command not found` | OpenCode not installed or PATH not updated | Install OpenCode and open a new shell |
+        | `connection refused` to localhost:11434 | Ollama service not running | Start with `ollama serve` or `sudo systemctl start ollama` |
-        | OpenCode cannot reach Ollama | `baseURL` misconfigured or Ollama not running | Set `baseURL` to `http://localhost:11434/v1` and start Ollama |
+        | Sharded GGUF model pull fails with HTTP 400 | Ollama does not support pulling sharded GGUF models from Hugging Face | Use the documented `qwen3.6:27b` model instead: `ollama pull qwen3.6:27b`. |
-        | `codex: command not found` | Codex CLI not installed or PATH not updated | Install Codex CLI and open a new shell |
+        | `CUDA error: context is destroyed` on a dual-GPU Station | Ollama may fail when both the GB300 and RTX PRO 6000 GPUs are visible | Run Ollama with one visible GPU. For example, set `CUDA_VISIBLE_DEVICES=1` in the Ollama service environment, restart Ollama, and rerun the playbook. |
-        | Codex CLI uses the wrong model/provider | `~/.codex/config.toml` not pointing to Ollama | Set `model_provider = "ollama"` and `base_url = "http://localhost:11434/v1"` |
+        | Claude Code edit task fails through the direct Ollama endpoint | Direct endpoint wiring can fail with some Ollama/model combinations | Launch Claude Code through Ollama instead: `ollama launch claude --model qwen3.6:27b`. |
-        | `connection refused` to localhost:11434 | Ollama service not running | Start with `ollama serve` or `systemctl start ollama` |
+        | `externally-managed-environment` or Python package install fails | System Python blocks direct package installs | Create and activate a virtual environment, then install pytest inside it: `python3 -m venv .venv`, `source .venv/bin/activate`, `python3 -m pip install -U pytest`. |
-        | Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station GB300 Ultra, ensure no other heavy GPU workloads. If OOM persists, use a smaller variant (e.g. `glm-4.7-flash:q8_0` or `glm-4.7-flash:q4_K_M`) or `OLLAMA_MAX_LOADED_MODELS=1`. |
+        | Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station (NVIDIA GB300), ensure no other heavy GPU workloads. If OOM persists, unload other models or set `OLLAMA_MAX_LOADED_MODELS=1`. |
        | `claude: command not found` after install | CLI not on PATH or install script did not complete | Restart the terminal or run `source ~/.bashrc` (or your shell profile). Check the install script output for the install path and add it to PATH. |
        | Claude Code install fails (Node.js / network) | Node.js missing or install script cannot download | Ensure Node.js is installed (`node --version`). Run the installer with Bash: `curl -fsSL https://claude.ai/install.sh | bash`. If the install script fails with a network error, retry from a stable connection or download the Claude Code CLI from the official site. See [Claude Code documentation](https://claude.ai/docs) for alternatives. |
        > [!NOTE]
-        > DGX Station with GB300 Ultra provides ample GPU memory for **GLM-4.7** and **GLM-4.7-Flash** in high-quality
+        > DGX Station with **NVIDIA GB300** provides ample GPU memory for the documented `qwen3.6:27b` workflow. Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models.
        > variants (e.g. `glm-4.7-flash:bf16`). Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models.
@ -291,31 +318,11 @@ spec:
      url: https://ollama.com/docs
-    - name: GLM-4.7-Flash (Ollama)
+    - name: Qwen3.6 27B
-      url: https://ollama.com/library/glm-4.7-flash
+      url: https://ollama.com/library/qwen3.6
    - name: GLM-4.7 (Ollama)
      url: https://ollama.com/library/glm-4.7
    - name: Claude Code + Ollama Guide
      url: https://ollama.com/blog/claude
    - name: OpenCode Ollama Provider
      url: https://opencode.ai/docs/providers/#ollama
    - name: Codex + Ollama Guide
      url: https://ollama.com/blog/codex
    - name: DGX Station Documentation
      url: https://docs.nvidia.com/dgx/dgx-station
    - name: DGX Station Forum
      url: https://forums.developer.nvidia.com/c/accelerated-computing/dgx-station
--- a/nvidia/station-vllm/endpoint-production.yaml
+++ b/nvidia/station-vllm/endpoint-production.yaml
@ -65,6 +65,8 @@ spec:
        | Model | Quantization | Support Status | HF Handle |
        |-------|-------------|----------------|-----------|
        | **DiffusionGemma 26B A4B IT** | BF16 | ✅ | [`google/diffusiongemma-26B-A4B-it`](https://huggingface.co/google/diffusiongemma-26B-A4B-it) |
        | **DiffusionGemma 26B A4B IT** | NVFP4 | ✅ | [`nvidia/diffusiongemma-26B-A4B-it-NVFP4`](https://huggingface.co/nvidia/diffusiongemma-26B-A4B-it-NVFP4) |
        | **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
        | **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
        | **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
@ -74,7 +76,7 @@ spec:
        * **Duration:** 30 minutes (longer on first run due to model download)
        * **Risks:** Model download requires HuggingFace authentication
        * **Rollback:** Stop and remove the container to restore state
-        * **Last Updated:** 05/28/2026
+        * **Last Updated:** 06/10/2026
          * Update models
@ -117,6 +119,12 @@ spec:
        docker pull nvcr.io/nvidia/vllm:26.01-py3
        ```
        For DiffusionGemma, use the vLLM custom container:
        ```bash
        docker pull vllm/vllm-openai:gemma
        ```
        For Step-3.7-Flash models, pull the custom VLLM container
        ```bash
        docker pull vllm/vllm-openai:stepfun37
@ -144,6 +152,34 @@ spec:
            --gpu-memory-utilization 0.9
        ```
        For DiffusionGemma models (e.g. `google/diffusiongemma-26B-A4B-it`), run with custom VLLM container.
        ```bash
        docker run -d \
          --name vllm-server \
          -p 8000:8000 \
          --gpus all \
          --shm-size=16g \
          --ulimit memlock=-1 \
          --ulimit stack=67108864 \
          -e VLLM_USE_V2_MODEL_RUNNER=1 \
          -e HF_TOKEN="$HF_TOKEN" \
          vllm/vllm-openai:gemma ${MODEL_HANDLE} \
          --gpu-memory-utilization 0.85 \
          --attention-backend TRITON_ATTN \
          --max-num-seqs 16 \
          --diffusion-config '{"canvas_length":256}' \
          --override-generation-config '{"max_new_tokens": null}' \
          --load-format fastsafetensors \
          --enable-prefix-caching \
          --reasoning-parser gemma4 \
          --default-chat-template-kwargs '{"enable_thinking": true}' \
          --enable-auto-tool-choice \
          --tool-call-parser gemma4
        # For BF16 checkpoint add "--moe-backend triton" for better performance
        ```
        For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
        ```bash
--- a/nvidia/station-vllm/endpoint-test.yaml
+++ b/nvidia/station-vllm/endpoint-test.yaml
@ -70,6 +70,8 @@ spec:
        | **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
        | **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
        | **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
        | **Kimi-K2.5 (1T)** | NVFP4 | ✅ | [`nvidia/Kimi-K2.5-NVFP4`](https://huggingface.co/nvidia/Kimi-K2.5-NVFP4) |
        | **DeepSeek-V4-Flash** | NVFP4 | ✅ | [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) |
        # Time & risk
@ -78,6 +80,7 @@ spec:
        * **Rollback:** Stop and remove the container to restore state
        * **Last Updated:** 06/10/2026
          * Update models
          * Add base configuration example, per-setting explanations, and DeepSeek-V4-Flash recipe
@ -130,11 +133,23 @@ spec:
        docker pull vllm/vllm-openai:stepfun37
        ```
        For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, pull the **26.03** image, which includes the `--cpu-offload-params` support used below:
        ```bash
        docker pull nvcr.io/nvidia/vllm:26.03-py3
        ```
        For DeepSeek-V4-Flash, pull the stable DeepSeek-V4 release container. Use the **cu130** build on DGX Station (Blackwell):
        ```bash
        docker pull vllm/vllm-openai:v0.20.0-cu130
        ```
        # Step 4. Start vLLM server
        Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
-        For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.
+        ## Base configuration (most models)
        This is the recommended starting point for any model that fits entirely in VRAM on the GB300. The Qwen3-235B-A22B-NVFP4 model, for example, runs directly with this configuration.
        ```bash
        docker run -d \
@ -152,6 +167,12 @@ spec:
            --gpu-memory-utilization 0.9
        ```
        Settings used:
        - `--max-model-len` — maximum context length (prompt + output) per request. Larger values reserve more GPU memory for the KV cache; size it to your workload.
        - `--gpu-memory-utilization 0.9` — fraction of GPU memory vLLM may use for weights and KV cache. `0.9` leaves headroom for other processes; raise toward `0.95` to fit more KV cache if the GPU is dedicated.
        ## DiffusionGemma 26B A4B
        For DiffusionGemma models (e.g. `google/diffusiongemma-26B-A4B-it`), run with custom VLLM container.
        ```bash
@ -180,6 +201,8 @@ spec:
        # For BF16 checkpoint add "--moe-backend triton" for better performance
        ```
        ## Step-3.7-Flash (FP8 / NVFP4)
        For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
        ```bash
@ -202,6 +225,94 @@ spec:
            --kv-cache-dtype fp8
        ```
        Settings used (in addition to the base configuration):
        - `--trust-remote-code` — allows the model's custom modeling code (shipped in its repo) to load. Required for Step-3.7.
        - `--reasoning-parser step3p5` — parses the model's reasoning/thinking tokens into the dedicated `reasoning_content` response field.
        - `--enable-auto-tool-choice` — lets the model decide when to call a tool, enabling OpenAI-compatible function calling.
        - `--tool-call-parser step3p5` — parses the model's tool-call output into structured `tool_calls`. Pairs with `--enable-auto-tool-choice`.
        - `--kv-cache-dtype fp8` — stores the KV cache in FP8, roughly halving KV-cache memory versus 16-bit and allowing more concurrent/longer sequences.
        ## Kimi-K2.5 NVFP4 (1T) — CPU offloading
        For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, run with the **26.03** NGC container. This model does not fit entirely in VRAM, so the MoE expert weights are offloaded to CPU DRAM with `--cpu-offload-gb 375 --cpu-offload-params experts`. Ensure the system has enough free DRAM to hold the offloaded weights.
        ```bash
        docker run -d \
          --name vllm-server \
          --gpus all \
          --ipc host \
          --ulimit memlock=-1 \
          --ulimit stack=67108864 \
          -p 8000:8000 \
          -e HF_TOKEN="$HF_TOKEN" \
          -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
          nvcr.io/nvidia/vllm:26.03-py3 \
          vllm serve nvidia/Kimi-K2.5-NVFP4 \
            --host 0.0.0.0 \
            --port 8000 \
            --dtype auto \
            --kv-cache-dtype auto \
            --gpu-memory-utilization 0.95 \
            --served-model-name nvidia/Kimi-K2.5-NVFP4 \
            --tensor-parallel-size 1 \
            --no-enable-prefix-caching \
            --trust-remote-code \
            --max-model-len 40960 \
            --max-num-seqs 1 \
            --max-num-batched-tokens 32768 \
            --cpu-offload-gb 375 \
            --cpu-offload-params experts
        ```
        Settings used (in addition to the base configuration):
        - `--cpu-offload-gb 375` — amount of CPU DRAM (in GiB) vLLM may use to hold weights that don't fit in VRAM. Must be large enough for the offloaded experts; the system needs at least this much free DRAM.
        - `--cpu-offload-params experts` — offloads only the MoE expert weights (the bulk of a large MoE model) to DRAM, keeping attention and other hot weights in VRAM.
        - `--tensor-parallel-size 1` — single GPU; the GB300 serves the whole model.
        - `--max-num-seqs 1` / `--max-num-batched-tokens 32768` — caps concurrency to one sequence and the batch token budget. With expert weights paged from DRAM, throughput is offload-bound, so a low concurrency keeps latency predictable.
        - `--no-enable-prefix-caching` — disables prefix-cache reuse. Offloaded experts make the memory budget tight, so the cache is turned off here rather than spent on KV reuse.
        - `--kv-cache-dtype auto` / `--dtype auto` — let vLLM pick the KV-cache and compute dtypes from the model's quantization (NVFP4).
        ## DeepSeek-V4-Flash — MTP + agentic
        For DeepSeek-V4-Flash, run with the stable **v0.20.0-cu130** container. This recipe targets agentic workloads and enables Multi-Token Prediction (MTP) speculative decoding. On a single GB300 (TP1) the MoE expert-parallel path is used; the `deep_gemm_mega_moe` backend from some internal recipes is not needed at TP1 and is omitted here.
        ```bash
        docker run -d \
          --name vllm-server \
          --gpus all \
          --ipc host \
          --ulimit memlock=-1 \
          --ulimit stack=67108864 \
          -p 8000:8000 \
          -e HF_TOKEN="$HF_TOKEN" \
          -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
          vllm/vllm-openai:v0.20.0-cu130 \
          deepseek-ai/DeepSeek-V4-Flash \
            --enable-expert-parallel \
            --kv-cache-dtype fp8 \
            --trust-remote-code \
            --block-size 256 \
            --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
            --attention_config.use_fp4_indexer_cache True \
            --tokenizer-mode deepseek_v4 \
            --tool-call-parser deepseek_v4 \
            --enable-auto-tool-choice \
            --reasoning-parser deepseek_v4 \
            --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
            --max-model-len 32768
        ```
        Settings used (in addition to the base configuration):
        - `--enable-expert-parallel` — shards the MoE experts across the available GPU(s) using expert parallelism, the recommended MoE execution path for DeepSeek-V4.
        - `--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'` — enables **MTP (Multi-Token Prediction)** speculative decoding: the model proposes 3 tokens per step that are verified in a single forward pass, cutting latency for accepted tokens.
        - `--kv-cache-dtype fp8` — FP8 KV cache to fit more concurrent/longer sequences.
        - `--block-size 256` — KV-cache page size in tokens. DeepSeek-V4 uses multiple KV-cache groups; `256` matches the recipe validated on Station.
        - `--attention_config.use_fp4_indexer_cache True` — enables the FP4 indexer cache used by DeepSeek-V4's attention. (Drop this flag on platforms without native FP4, e.g. Hopper.)
        - `--tokenizer-mode deepseek_v4` / `--tool-call-parser deepseek_v4` / `--reasoning-parser deepseek_v4` — DeepSeek-V4-specific tokenizer, tool-call, and reasoning parsers.
        - `--enable-auto-tool-choice` — OpenAI-compatible function calling for agentic use.
        - `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` — uses full + piecewise CUDA graph capture and enables all custom ops for lower per-step overhead.
        - **Prefix caching is left enabled (the vLLM default).** For agentic workloads with large shared prefixes (e.g. a 32k system/context prefix) at low batch sizes (~BS 3–4), prefix caching gives a significant throughput boost by reusing the cached prefix across requests. Some internal recipes carry `--no-enable-prefix-caching`, but that was inherited from random-data benchmarking and is not recommended for agentic use here.
        Check the server logs for startup progress:
        ```bash