mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-06-18 20:42:20 +00:00
323 lines
12 KiB
Markdown
323 lines
12 KiB
Markdown
# Local Coding Agent
|
||
|
||
> Run local CLI coding agents with Ollama on DGX Station (NVIDIA GB300) using glm-4.7-flash (fast) or unsloth/GLM-4.7-GGUF:Q8_0 (best quality)
|
||
|
||
|
||
## Table of Contents
|
||
|
||
- [Overview](#overview)
|
||
- [Claude Code](#claude-code)
|
||
- [Troubleshooting](#troubleshooting)
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
## Basic idea
|
||
|
||
Use Ollama on **DGX Station (NVIDIA GB300)** to run local coding models and connect a CLI coding agent. This
|
||
playbook uses **Claude Code** to talk to Ollama for local inference, so you can work without external cloud APIs.
|
||
|
||
The DGX Station GPU (reported as **NVIDIA GB300** in `nvidia-smi`) provides ample memory to run **glm-4.7-flash** (fast loading and testing) and larger models such as **unsloth/GLM-4.7-GGUF:Q8_0** (best quality), both supported on Ollama.
|
||
|
||
## CLI agent
|
||
|
||
This playbook uses **Claude Code** as the CLI agent, connected to a local Ollama model for inference.
|
||
|
||
## What you'll accomplish
|
||
|
||
You will run a local coding model on your **DGX Station (NVIDIA GB300)** with Ollama, connect Claude Code to it, and complete a small coding task end-to-end. Use **glm-4.7-flash** (including high-quality variants) or **unsloth/GLM-4.7-GGUF:Q8_0** for best quality.
|
||
|
||
## What to know before starting
|
||
|
||
- Comfort with Linux command line basics
|
||
- Experience running terminal-based tools and editors
|
||
- Familiarity with Python for the short coding task
|
||
|
||
## Prerequisites
|
||
|
||
- **DGX Station** with **NVIDIA GB300** (Grace Blackwell) and NVIDIA driver; `nvidia-smi` typically shows "NVIDIA GB300"
|
||
- Internet access to download model weights
|
||
- **Ollama 0.15.0 or newer** (required for GLM-4.7-Flash; do not pin to 0.14.3)
|
||
- **GPU memory** on GB300 supports both recommended models:
|
||
- **glm-4.7-flash**: ~19 GB (`latest`) to ~60 GB (bf16) — **recommended for fast loading and testing**
|
||
- **unsloth/GLM-4.7-GGUF:Q8_0** (Hugging Face on Ollama): larger model — **recommended for best quality**
|
||
- Other variants (e.g. `glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) fit on GB300
|
||
- **Disk space** for model downloads: plan for ~19 GB for `glm-4.7-flash:latest`, plus additional space for the Q8_0 or bf16 variants if you use them
|
||
|
||
## Time & risk
|
||
|
||
* **Duration**: ~20–30 minutes (includes model download)
|
||
* **Risk level**: Low
|
||
* Large model downloads can fail if network connectivity is unstable
|
||
* Older Ollama versions will not load newer models
|
||
* **Rollback**: Stop Ollama and delete the downloaded model from `~/.ollama/models`
|
||
* **Last Updated:** 03/06/2026
|
||
* Model set to glm-4.7-flash; Ollama 0.15.0+; cleanup order and docs refresh
|
||
|
||
## Claude Code
|
||
|
||
## Step 1. Confirm your environment
|
||
|
||
**Description**: Verify the GPU is visible before installing anything.
|
||
|
||
```bash
|
||
nvidia-smi
|
||
```
|
||
|
||
**Expected output** (example): A table showing driver version and GPU(s). On DGX Station, the GPU name may appear as **NVIDIA GB300** (without "Ultra"):
|
||
|
||
```text
|
||
+-----------------------------------------------------------------------------+
|
||
| NVIDIA-SMI 5xx.xx Driver Version: 5xx.xx CUDA Version: 12.x |
|
||
|-------------------------------+----------------------+----------------------+
|
||
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
|
||
| 0 NVIDIA GB300 On | 00000000:06:00.0 Off | 0 |
|
||
...
|
||
```
|
||
|
||
## Step 2. Install or update Ollama
|
||
|
||
**Description**: Install Ollama or ensure it is recent enough for modern coding models.
|
||
|
||
```bash
|
||
curl -fsSL https://ollama.com/install.sh | sh
|
||
ollama --version
|
||
```
|
||
|
||
To install a specific version (e.g. 0.15.0 or newer, required for GLM-4.7-Flash):
|
||
|
||
```bash
|
||
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.15.0 sh
|
||
```
|
||
|
||
If Ollama is already present and the version is 0.15.0 or newer, simply run:
|
||
|
||
```bash
|
||
ollama --version
|
||
```
|
||
|
||
**Expected output** (example):
|
||
|
||
```text
|
||
ollama version is 0.15.0
|
||
```
|
||
|
||
## Step 3. Pull a coding model
|
||
|
||
**Description**: Download the model weights to your DGX Station. This playbook supports two model options on Ollama; choose one (or both) depending on whether you want **fast loading and testing** or **best quality**.
|
||
|
||
**For fast loading and testing** — **glm-4.7-flash** (~19 GB for `latest`; loads quickly; ensure Ollama 0.15.0+):
|
||
|
||
```bash
|
||
ollama pull glm-4.7-flash
|
||
```
|
||
|
||
**For best quality** — **unsloth/GLM-4.7-GGUF:Q8_0** from Hugging Face (larger, higher quality; supported on Ollama):
|
||
|
||
```bash
|
||
ollama pull hf.co/unsloth/GLM-4.7-GGUF:Q8_0
|
||
```
|
||
|
||
**Other glm-4.7-flash variants** on GB300 (more GPU memory; bf16 is ~60 GB):
|
||
|
||
```bash
|
||
ollama pull glm-4.7-flash:q8_0
|
||
ollama pull glm-4.7-flash:bf16
|
||
```
|
||
|
||
**Expected output** (example): Progress lines followed by "success" and the model in `ollama list`:
|
||
|
||
```bash
|
||
ollama list
|
||
```
|
||
|
||
```text
|
||
NAME ID SIZE MODIFIED
|
||
glm-4.7-flash:latest abc123... 19 GB 1 minute ago
|
||
unsloth/GLM-4.7-GGUF:Q8_0 def456... ... ...
|
||
```
|
||
|
||
## Step 4. Test local inference
|
||
|
||
**Description**: Run a quick prompt to confirm the model loads. Use the same model name you pulled (e.g. `glm-4.7-flash` for fast testing, or `hf.co/unsloth/GLM-4.7-GGUF:Q8_0` for best quality).
|
||
|
||
```bash
|
||
ollama run glm-4.7-flash
|
||
```
|
||
|
||
Or, if you pulled the larger model:
|
||
|
||
```bash
|
||
ollama run hf.co/unsloth/GLM-4.7-GGUF:Q8_0
|
||
```
|
||
|
||
Try a prompt like:
|
||
|
||
```text
|
||
Write a short README checklist for a Python project.
|
||
```
|
||
|
||
**Expected output**: GLM-4.7-Flash may show **Thinking...** and reasoning text before the final answer, then the model's response. This is normal; wait for the reply to complete.
|
||
|
||
**Exit the Ollama REPL** when done: type `/bye` or press **Ctrl+D**.
|
||
|
||
## Step 5. Install Claude Code
|
||
|
||
**Description**: Install the CLI tool that will drive the local model.
|
||
|
||
```bash
|
||
curl -fsSL https://claude.ai/install.sh | sh
|
||
```
|
||
|
||
**Verify the installation**:
|
||
|
||
```bash
|
||
claude --version
|
||
```
|
||
|
||
**Expected output** (example): A version string such as `claude 0.x.x` or similar. If you see `claude: command not found`, ensure the install script added the CLI to your PATH (e.g. restart the terminal or source your shell profile); see [Troubleshooting](troubleshooting.md).
|
||
|
||
## Step 6. Increase context length (optional)
|
||
|
||
**Description**: Ollama defaults to a 4096 token context length. For coding agents and
|
||
larger codebases, set it to 64K tokens. This increases memory usage.
|
||
For more details on configuring context length and other parameters, see the Ollama documentation (context window and runtime options).
|
||
|
||
Set the context length per session in the Ollama REPL (use the same model name you pulled, e.g. `glm-4.7-flash` or `hf.co/unsloth/GLM-4.7-GGUF:Q8_0`):
|
||
|
||
```bash
|
||
ollama run glm-4.7-flash
|
||
```
|
||
|
||
Then, in the Ollama prompt:
|
||
|
||
```text
|
||
/set parameter num_ctx 64000
|
||
|
||
```
|
||
|
||
**Exit when done**: type `/bye` or press **Ctrl+D**.
|
||
|
||
Optional method (set globally when serving Ollama):
|
||
|
||
```bash
|
||
sudo systemctl stop ollama
|
||
OLLAMA_CONTEXT_LENGTH=64000 ollama serve
|
||
```
|
||
|
||
Keep this terminal open and run the next step in a new terminal.
|
||
|
||
## Step 7. Connect Claude Code to Ollama
|
||
|
||
**Description**: Point Claude Code to the local Ollama server and launch it. Use the model you pulled: `glm-4.7-flash` (fast) or `hf.co/unsloth/GLM-4.7-GGUF:Q8_0` (best quality).
|
||
|
||
```bash
|
||
export ANTHROPIC_AUTH_TOKEN=ollama
|
||
export ANTHROPIC_BASE_URL=http://localhost:11434
|
||
|
||
claude --model glm-4.7-flash
|
||
```
|
||
|
||
If you are using the larger model:
|
||
|
||
```bash
|
||
claude --model hf.co/unsloth/GLM-4.7-GGUF:Q8_0
|
||
```
|
||
|
||
- **`ANTHROPIC_AUTH_TOKEN=ollama`**: Claude Code treats the literal value `ollama` as a special token that means "use the local Ollama backend" instead of Anthropic's cloud API. No real API key is needed when using Ollama.
|
||
- **`ANTHROPIC_BASE_URL`**: Tells Claude Code to send requests to your local Ollama server at port 11434.
|
||
|
||
**Persist these variables** (optional) so you don't have to re-export every terminal session. Add to `~/.bashrc` or your shell profile (e.g. `~/.zshrc`):
|
||
|
||
```bash
|
||
echo 'export ANTHROPIC_AUTH_TOKEN=ollama' >> ~/.bashrc
|
||
echo 'export ANTHROPIC_BASE_URL=http://localhost:11434' >> ~/.bashrc
|
||
source ~/.bashrc
|
||
```
|
||
|
||
**Expected output**: Claude Code starts and uses the local model.
|
||
|
||
**Exit Claude Code** when done: type `/exit` or press **Ctrl+C**.
|
||
|
||
## Step 8. Complete a small coding task
|
||
|
||
**Description**: Create a tiny repo and let Claude Code implement a function and tests.
|
||
|
||
```bash
|
||
mkdir -p ~/cli-agent-demo
|
||
cd ~/cli-agent-demo
|
||
|
||
printf 'def add(a, b):\n """Return the sum of a and b."""\n pass\n' > math_utils.py
|
||
printf 'import math_utils\n\n\ndef test_add():\n assert math_utils.add(1, 2) == 3\n' > test_math_utils.py
|
||
```
|
||
|
||
If you do not already have pytest installed:
|
||
|
||
```bash
|
||
python -m pip install -U pytest
|
||
```
|
||
|
||
In Claude Code, enter:
|
||
|
||
```text
|
||
Please implement add() in math_utils.py and make sure the test passes.
|
||
```
|
||
|
||
**Exit Claude Code** when finished: type `/exit` or press **Ctrl+C**, then run the test:
|
||
|
||
```bash
|
||
python -m pytest -q
|
||
```
|
||
|
||
Expected output should show the test passing.
|
||
|
||
## Step 9. Cleanup and rollback
|
||
|
||
**Description**: Remove the model and stop the Ollama service if you no longer need them. **Remove the model first** (while the Ollama server is running), then stop the service.
|
||
|
||
> [!WARNING]
|
||
> The following removes the downloaded model files from disk.
|
||
|
||
**1. Remove the model** (Ollama must be running). Use the same name you pulled:
|
||
|
||
```bash
|
||
ollama rm glm-4.7-flash
|
||
```
|
||
|
||
Or, for the Hugging Face model:
|
||
|
||
```bash
|
||
ollama rm hf.co/unsloth/GLM-4.7-GGUF:Q8_0
|
||
```
|
||
|
||
Use the exact tag you pulled (e.g. `glm-4.7-flash:bf16` if you used that variant).
|
||
|
||
**2. Stop the Ollama service**:
|
||
|
||
```bash
|
||
sudo systemctl stop ollama
|
||
```
|
||
|
||
## Step 10. Next steps
|
||
|
||
- **Fast loading and testing:** use **glm-4.7-flash** for quick iteration and smaller downloads.
|
||
- **Best quality:** use **unsloth/GLM-4.7-GGUF:Q8_0** (Hugging Face on Ollama) or **glm-4.7-flash** high-quality variants (`glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) on DGX Station (NVIDIA GB300).
|
||
- Use larger context (e.g. 64K–198K) for big codebases.
|
||
- Use Claude Code on multi-file refactors or test-generation tasks.
|
||
|
||
## Troubleshooting
|
||
|
||
| Symptom | Cause | Fix |
|
||
|---------|-------|-----|
|
||
| `ollama: command not found` | Ollama not installed or PATH not updated | Rerun `curl -fsSL https://ollama.com/install.sh | sh` and open a new shell |
|
||
| Model load fails with version error | Ollama is older than 0.15.0 | Update Ollama to 0.15.0 or newer (required for GLM-4.7-Flash). Do not pin to 0.14.3. |
|
||
| `model not found` in Claude Code | Model was not pulled | Run `ollama pull glm-4.7-flash` or `ollama pull hf.co/unsloth/GLM-4.7-GGUF:Q8_0` and retry. Use the same model name in `claude --model ...`. |
|
||
| `connection refused` to localhost:11434 | Ollama service not running | Start with `ollama serve` or `sudo systemctl start ollama` |
|
||
| Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station (NVIDIA GB300), ensure no other heavy GPU workloads. If OOM persists, use a smaller variant (e.g. `glm-4.7-flash:q8_0` or `glm-4.7-flash:q4_K_M`) or `OLLAMA_MAX_LOADED_MODELS=1`. |
|
||
| `claude: command not found` after install | CLI not on PATH or install script did not complete | Restart the terminal or run `source ~/.bashrc` (or your shell profile). Check the install script output for the install path and add it to PATH. |
|
||
| Claude Code install fails (Node.js / network) | Node.js missing or install script cannot download | Ensure Node.js is installed (`node --version`). If the install script fails with a network error, retry from a stable connection or download the Claude Code CLI from the official site. See [Claude Code documentation](https://claude.ai/docs) for alternatives. |
|
||
|
||
> [!NOTE]
|
||
> DGX Station with **NVIDIA GB300** provides ample GPU memory for **glm-4.7-flash** (fast testing) and **unsloth/GLM-4.7-GGUF:Q8_0** (best quality), plus variants (e.g. `glm-4.7-flash:bf16`). Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models.
|