kind: Playbook metadata: name: station-local-coding-agent displayName: Local Coding Agent shortDescription: Run local CLI coding agents with Ollama on DGX Station (GB300 Ultra) using GLM-4.7 and GLM-4.7-Flash publisher: nvidia description: | # REPLACE THIS WITH YOUR MODEL CARD https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads labelsV2: - gpuType:playbook:gpu_type_station - DGX Station - GB300 - Coding - LLM - Ollama - Claude Code - OpenCode - Codex attributes: - key: DURATION value: 30 MINS spec: artifactName: station-local-coding-agent nvcfFunctionId: None attributes: showUnavailableBanner: false apiDocsUrl: None termsOfUse: | tabs: - id: overview label: Overview content: | # Basic idea Use Ollama on **DGX Station with GB300 Ultra** to run local coding models and connect a CLI coding agent. This playbook supports three options: **Claude Code**, **OpenCode**, and **Codex CLI**. Each agent talks to Ollama for local inference, so you can work without external cloud APIs. The GB300 Ultra’s massive GPU memory lets you run **GLM-4.7** and **GLM-4.7-Flash** in high-quality variants (e.g. bf16, q8_0) for the best coding-assistant quality directly on the Station. # Choose your CLI agent Pick the tab that matches the CLI agent you want to use: - **Claude Code**: Fastest path to a working CLI agent with a local Ollama model. - **OpenCode**: Open-source CLI with provider configuration; this guide targets Ollama. - **Codex CLI**: OpenAI Codex CLI configured to run against Ollama locally. # What you'll accomplish You will run a local coding model on your **DGX Station (GB300 Ultra)** with Ollama, connect it to your chosen CLI agent, and complete a small coding task end-to-end. You can use **GLM-4.7** or **GLM-4.7-Flash** (including high-quality variants) to take full advantage of the Station’s memory. # What to know before starting - Comfort with Linux command line basics - Experience running terminal-based tools and editors - Familiarity with Python for the short coding task # Prerequisites - **DGX Station** with **GB300 Ultra** (Grace Blackwell) and NVIDIA driver - Internet access to download model weights - Ollama 0.14.3 or newer - **GPU memory** on GB300 Ultra supports GLM-4.7 and high-quality variants: - **GLM-4.7-Flash** (30B): ~19GB (latest) to ~60GB (bf16) — recommended default for coding - **GLM-4.7** (full): use `ollama pull glm-4.7` for higher quality when available - High-quality variants (e.g. `glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) fit comfortably on GB300 Ultra # Time & risk * **Duration**: ~20–30 minutes (includes model download) * **Risk level**: Low * Large model downloads can fail if network connectivity is unstable * Older Ollama versions will not load newer models * **Rollback**: Stop Ollama and delete the downloaded model from `~/.ollama/models` * **Last Updated:** February 2025 * Tailored for DGX Station with GB300 Ultra; added large-model recommendations - id: claude-code label: Claude Code content: | # Step 1. Confirm your environment **Description**: Verify the GPU is visible before installing anything. ```bash nvidia-smi ``` Expected output should show a detected GPU (e.g. GB300 Ultra). # Step 2. Install or update Ollama **Description**: Install Ollama or ensure it is recent enough for modern coding models. ```bash curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.14.3 sh ollama --version ``` If the ollama is already present and the version is 0.14.3 or newer, simply run: ```bash ollama --version ``` Expected output should show `ollama --version` as 0.14.3 or newer. # Step 3. Pull a coding model **Description**: Download the model weights to your DGX Station. This playbook uses **GLM-4.7** where available. **Recommended: GLM-4.7**: ```bash ollama pull glm-4.7 ``` **High-quality variants** on GB300 Ultra (use more GPU memory for better quality): ```bash ollama pull glm-4.7-flash:q8_0 ollama pull glm-4.7-flash:bf16 ``` Expected output should show your model in `ollama list`. # Step 4. Test local inference **Description**: Run a quick prompt to confirm the model loads. Use the same model name you pulled (e.g. `glm-4.7`). ```bash ollama run glm-4.7 ``` Try a prompt like: ```text Write a short README checklist for a Python project. ``` Expected output should show the model responding in the terminal. # Step 5. Install Claude Code **Description**: Install the CLI tool that will drive the local model. ```bash curl -fsSL https://claude.ai/install.sh | sh ``` # Step 6. Increase context length (optional) **Description**: Ollama defaults to a 4096 token context length. For coding agents and larger codebases, set it to 64K tokens. This increases memory usage. For more details on configuring context length, see the [Ollama documentation](https://ollama.com/docs/faq#how-can-i-increase-the-context-length). Set the context length per session in the Ollama REPL: ```bash ollama run glm-4.7 ``` Then, in the Ollama prompt: ```text /set parameter num_ctx 64000 ``` Optional method (set globally when serving Ollama): ```bash sudo systemctl stop ollama OLLAMA_CONTEXT_LENGTH=64000 ollama serve ``` Keep this terminal open and run the next step in a new terminal. # Step 7. Connect Claude Code to Ollama **Description**: Point Claude Code to the local Ollama server and launch it. Use the model you pulled (e.g. GLM-4.7 or GLM-4.7-Flash). ```bash export ANTHROPIC_AUTH_TOKEN=ollama export ANTHROPIC_BASE_URL=http://localhost:11434 claude --model glm-4.7 ``` Expected output should show Claude Code starting and using the local model. # Step 8. Complete a small coding task **Description**: Create a tiny repo and let Claude Code implement a function and tests. ```bash mkdir -p ~/cli-agent-demo cd ~/cli-agent-demo printf 'def add(a, b):\n """Return the sum of a and b."""\n pass\n' > math_utils.py printf 'import math_utils\n\n\ndef test_add():\n assert math_utils.add(1, 2) == 3\n' > test_math_utils.py ``` If you do not already have pytest installed: ```bash python -m pip install -U pytest ``` In Claude Code: ```text Please implement add() in math_utils.py and make sure the test passes. ``` Run the test: ```bash python -m pytest -q ``` Expected output should show the test passing. # Step 9. Cleanup and rollback **Description**: Remove the model and stop services if you no longer need them. To stop the service: ```bash sudo systemctl stop ollama ``` > [!WARNING] > This will delete the downloaded model files. ```bash ollama rm glm-4.7 ``` # Step 10. Next steps - Use **GLM-4.7** or high-quality variants (`glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) on GB300 Ultra for best quality - Use larger context (e.g. 64K–198K) for big codebases - Use Claude Code on multi-file refactors or test-generation tasks - id: troubleshooting label: Troubleshooting content: | | Symptom | Cause | Fix | |---------|-------|-----| | `ollama: command not found` | Ollama not installed or PATH not updated | Rerun `curl -fsSL https://ollama.com/install.sh | sh` and open a new shell | | Model load fails with version error | Ollama is older than 0.14.3 | Update Ollama to 0.14.3 or newer | | `model not found` in Claude Code | Model was not pulled | Run `ollama pull glm-4.7-flash` or `ollama pull glm-4.7` and retry | | `opencode: command not found` | OpenCode not installed or PATH not updated | Install OpenCode and open a new shell | | OpenCode cannot reach Ollama | `baseURL` misconfigured or Ollama not running | Set `baseURL` to `http://localhost:11434/v1` and start Ollama | | `codex: command not found` | Codex CLI not installed or PATH not updated | Install Codex CLI and open a new shell | | Codex CLI uses the wrong model/provider | `~/.codex/config.toml` not pointing to Ollama | Set `model_provider = "ollama"` and `base_url = "http://localhost:11434/v1"` | | `connection refused` to localhost:11434 | Ollama service not running | Start with `ollama serve` or `systemctl start ollama` | | Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station GB300 Ultra, ensure no other heavy GPU workloads. If OOM persists, use a smaller variant (e.g. `glm-4.7-flash:q8_0` or `glm-4.7-flash:q4_K_M`) or `OLLAMA_MAX_LOADED_MODELS=1`. | > [!NOTE] > DGX Station with GB300 Ultra provides ample GPU memory for **GLM-4.7** and **GLM-4.7-Flash** in high-quality > variants (e.g. `glm-4.7-flash:bf16`). Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models. resources: - name: Ollama Documentation url: https://ollama.com/docs - name: GLM-4.7-Flash (Ollama) url: https://ollama.com/library/glm-4.7-flash - name: GLM-4.7 (Ollama) url: https://ollama.com/library/glm-4.7 - name: Claude Code + Ollama Guide url: https://ollama.com/blog/claude - name: OpenCode Ollama Provider url: https://opencode.ai/docs/providers/#ollama - name: Codex + Ollama Guide url: https://ollama.com/blog/codex - name: DGX Station Documentation url: https://docs.nvidia.com/dgx/dgx-station - name: DGX Station Forum url: https://forums.developer.nvidia.com/c/accelerated-computing/dgx-station