dgx-spark-playbooks/nvidia/station-local-coding-agent/endpoint-test.yaml
2026-06-11 01:07:29 +00:00

322 lines
11 KiB
YAML
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

kind: Playbook
metadata:
name: station-local-coding-agent
displayName: Local Coding Agent
shortDescription: Run local CLI coding agents with Ollama on DGX Station (GB300 Ultra) using GLM-4.7 and GLM-4.7-Flash
publisher: nvidia
description: |
# REPLACE THIS WITH YOUR MODEL CARD
https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads
labelsV2:
- gpuType:playbook:gpu_type_station
- DGX Station
- GB300
- Coding
- LLM
- Ollama
- Claude Code
- OpenCode
- Codex
attributes:
- key: DURATION
value: 30 MINS
spec:
artifactName: station-local-coding-agent
nvcfFunctionId: None
attributes:
showUnavailableBanner: false
apiDocsUrl: None
termsOfUse: |
tabs:
-
id: overview
label: Overview
content: |
# Basic idea
Use Ollama on **DGX Station with GB300 Ultra** to run local coding models and connect a CLI coding agent. This
playbook supports three options: **Claude Code**, **OpenCode**, and **Codex CLI**. Each
agent talks to Ollama for local inference, so you can work without external cloud APIs.
The GB300 Ultras massive GPU memory lets you run **GLM-4.7** and **GLM-4.7-Flash** in high-quality variants (e.g. bf16, q8_0) for the best coding-assistant quality directly on the Station.
# Choose your CLI agent
Pick the tab that matches the CLI agent you want to use:
- **Claude Code**: Fastest path to a working CLI agent with a local Ollama model.
- **OpenCode**: Open-source CLI with provider configuration; this guide targets Ollama.
- **Codex CLI**: OpenAI Codex CLI configured to run against Ollama locally.
# What you'll accomplish
You will run a local coding model on your **DGX Station (GB300 Ultra)** with Ollama, connect it to your
chosen CLI agent, and complete a small coding task end-to-end. You can use **GLM-4.7** or **GLM-4.7-Flash** (including high-quality variants) to take full advantage of the Stations memory.
# What to know before starting
- Comfort with Linux command line basics
- Experience running terminal-based tools and editors
- Familiarity with Python for the short coding task
# Prerequisites
- **DGX Station** with **GB300 Ultra** (Grace Blackwell) and NVIDIA driver
- Internet access to download model weights
- Ollama 0.14.3 or newer
- **GPU memory** on GB300 Ultra supports GLM-4.7 and high-quality variants:
- **GLM-4.7-Flash** (30B): ~19GB (latest) to ~60GB (bf16) — recommended default for coding
- **GLM-4.7** (full): use `ollama pull glm-4.7` for higher quality when available
- High-quality variants (e.g. `glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) fit comfortably on GB300 Ultra
# Time & risk
* **Duration**: ~2030 minutes (includes model download)
* **Risk level**: Low
* Large model downloads can fail if network connectivity is unstable
* Older Ollama versions will not load newer models
* **Rollback**: Stop Ollama and delete the downloaded model from `~/.ollama/models`
* **Last Updated:** February 2025
* Tailored for DGX Station with GB300 Ultra; added large-model recommendations
-
id: claude-code
label: Claude Code
content: |
# Step 1. Confirm your environment
**Description**: Verify the GPU is visible before installing anything.
```bash
nvidia-smi
```
Expected output should show a detected GPU (e.g. GB300 Ultra).
# Step 2. Install or update Ollama
**Description**: Install Ollama or ensure it is recent enough for modern coding models.
```bash
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.14.3 sh
ollama --version
```
If the ollama is already present and the version is 0.14.3 or newer, simply run:
```bash
ollama --version
```
Expected output should show `ollama --version` as 0.14.3 or newer.
# Step 3. Pull a coding model
**Description**: Download the model weights to your DGX Station. This playbook uses **GLM-4.7** where available.
**Recommended: GLM-4.7**:
```bash
ollama pull glm-4.7
```
**High-quality variants** on GB300 Ultra (use more GPU memory for better quality):
```bash
ollama pull glm-4.7-flash:q8_0
ollama pull glm-4.7-flash:bf16
```
Expected output should show your model in `ollama list`.
# Step 4. Test local inference
**Description**: Run a quick prompt to confirm the model loads. Use the same model name you pulled (e.g. `glm-4.7`).
```bash
ollama run glm-4.7
```
Try a prompt like:
```text
Write a short README checklist for a Python project.
```
Expected output should show the model responding in the terminal.
# Step 5. Install Claude Code
**Description**: Install the CLI tool that will drive the local model.
```bash
curl -fsSL https://claude.ai/install.sh | sh
```
# Step 6. Increase context length (optional)
**Description**: Ollama defaults to a 4096 token context length. For coding agents and
larger codebases, set it to 64K tokens. This increases memory usage.
For more details on configuring context length, see the [Ollama documentation](https://ollama.com/docs/faq#how-can-i-increase-the-context-length).
Set the context length per session in the Ollama REPL:
```bash
ollama run glm-4.7
```
Then, in the Ollama prompt:
```text
/set parameter num_ctx 64000
```
Optional method (set globally when serving Ollama):
```bash
sudo systemctl stop ollama
OLLAMA_CONTEXT_LENGTH=64000 ollama serve
```
Keep this terminal open and run the next step in a new terminal.
# Step 7. Connect Claude Code to Ollama
**Description**: Point Claude Code to the local Ollama server and launch it. Use the model you pulled (e.g. GLM-4.7 or GLM-4.7-Flash).
```bash
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_BASE_URL=http://localhost:11434
claude --model glm-4.7
```
Expected output should show Claude Code starting and using the local model.
# Step 8. Complete a small coding task
**Description**: Create a tiny repo and let Claude Code implement a function and tests.
```bash
mkdir -p ~/cli-agent-demo
cd ~/cli-agent-demo
printf 'def add(a, b):\n """Return the sum of a and b."""\n pass\n' > math_utils.py
printf 'import math_utils\n\n\ndef test_add():\n assert math_utils.add(1, 2) == 3\n' > test_math_utils.py
```
If you do not already have pytest installed:
```bash
python -m pip install -U pytest
```
In Claude Code:
```text
Please implement add() in math_utils.py and make sure the test passes.
```
Run the test:
```bash
python -m pytest -q
```
Expected output should show the test passing.
# Step 9. Cleanup and rollback
**Description**: Remove the model and stop services if you no longer need them.
To stop the service:
```bash
sudo systemctl stop ollama
```
> [!WARNING]
> This will delete the downloaded model files.
```bash
ollama rm glm-4.7
```
# Step 10. Next steps
- Use **GLM-4.7** or high-quality variants (`glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) on GB300 Ultra for best quality
- Use larger context (e.g. 64K198K) for big codebases
- Use Claude Code on multi-file refactors or test-generation tasks
-
id: troubleshooting
label: Troubleshooting
content: |
| Symptom | Cause | Fix |
|---------|-------|-----|
| `ollama: command not found` | Ollama not installed or PATH not updated | Rerun `curl -fsSL https://ollama.com/install.sh | sh` and open a new shell |
| Model load fails with version error | Ollama is older than 0.14.3 | Update Ollama to 0.14.3 or newer |
| `model not found` in Claude Code | Model was not pulled | Run `ollama pull glm-4.7-flash` or `ollama pull glm-4.7` and retry |
| `opencode: command not found` | OpenCode not installed or PATH not updated | Install OpenCode and open a new shell |
| OpenCode cannot reach Ollama | `baseURL` misconfigured or Ollama not running | Set `baseURL` to `http://localhost:11434/v1` and start Ollama |
| `codex: command not found` | Codex CLI not installed or PATH not updated | Install Codex CLI and open a new shell |
| Codex CLI uses the wrong model/provider | `~/.codex/config.toml` not pointing to Ollama | Set `model_provider = "ollama"` and `base_url = "http://localhost:11434/v1"` |
| `connection refused` to localhost:11434 | Ollama service not running | Start with `ollama serve` or `systemctl start ollama` |
| Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station GB300 Ultra, ensure no other heavy GPU workloads. If OOM persists, use a smaller variant (e.g. `glm-4.7-flash:q8_0` or `glm-4.7-flash:q4_K_M`) or `OLLAMA_MAX_LOADED_MODELS=1`. |
> [!NOTE]
> DGX Station with GB300 Ultra provides ample GPU memory for **GLM-4.7** and **GLM-4.7-Flash** in high-quality
> variants (e.g. `glm-4.7-flash:bf16`). Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models.
resources:
- name: Ollama Documentation
url: https://ollama.com/docs
- name: GLM-4.7-Flash (Ollama)
url: https://ollama.com/library/glm-4.7-flash
- name: GLM-4.7 (Ollama)
url: https://ollama.com/library/glm-4.7
- name: Claude Code + Ollama Guide
url: https://ollama.com/blog/claude
- name: OpenCode Ollama Provider
url: https://opencode.ai/docs/providers/#ollama
- name: Codex + Ollama Guide
url: https://ollama.com/blog/codex
- name: DGX Station Documentation
url: https://docs.nvidia.com/dgx/dgx-station
- name: DGX Station Forum
url: https://forums.developer.nvidia.com/c/accelerated-computing/dgx-station