mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-06-18 04:22:21 +00:00

History

GitLab CI 4073d2c1de chore: Regenerate all playbooks		2026-05-26 18:25:53 +00:00
..
endpoint-production.yaml	chore: Regenerate all playbooks	2026-05-26 18:25:53 +00:00
endpoint-test.yaml	chore: Regenerate all playbooks	2026-05-26 18:25:53 +00:00
overview.md	chore: Regenerate all playbooks	2026-05-26 18:25:53 +00:00
README.md	chore: Regenerate all playbooks	2026-05-26 18:25:53 +00:00

README.md

Local Coding Agent

Run local CLI coding agents with Ollama on DGX Station (NVIDIA GB300) using glm-4.7-flash (fast) or unsloth/GLM-4.7-GGUF:Q8_0 (best quality)

Overview
Claude Code
Troubleshooting

Overview

Basic idea

Use Ollama on DGX Station (NVIDIA GB300) to run local coding models and connect a CLI coding agent. This playbook uses Claude Code to talk to Ollama for local inference, so you can work without external cloud APIs.

The DGX Station GPU (reported as NVIDIA GB300 in nvidia-smi) provides ample memory to run glm-4.7-flash (fast loading and testing) and larger models such as unsloth/GLM-4.7-GGUF:Q8_0 (best quality), both supported on Ollama.

CLI agent

This playbook uses Claude Code as the CLI agent, connected to a local Ollama model for inference.

What you'll accomplish

You will run a local coding model on your DGX Station (NVIDIA GB300) with Ollama, connect Claude Code to it, and complete a small coding task end-to-end. Use glm-4.7-flash (including high-quality variants) or unsloth/GLM-4.7-GGUF:Q8_0 for best quality.

What to know before starting

Comfort with Linux command line basics
Experience running terminal-based tools and editors
Familiarity with Python for the short coding task

Prerequisites

DGX Station with NVIDIA GB300 (Grace Blackwell) and NVIDIA driver; nvidia-smi typically shows "NVIDIA GB300"
Internet access to download model weights
Ollama 0.15.0 or newer (required for GLM-4.7-Flash; do not pin to 0.14.3)
GPU memory on GB300 supports both recommended models:
- glm-4.7-flash: ~19 GB (latest) to ~60 GB (bf16) — recommended for fast loading and testing
- unsloth/GLM-4.7-GGUF:Q8_0 (Hugging Face on Ollama): larger model — recommended for best quality
- Other variants (e.g. glm-4.7-flash:bf16, glm-4.7-flash:q8_0) fit on GB300
Disk space for model downloads: plan for ~19 GB for glm-4.7-flash:latest, plus additional space for the Q8_0 or bf16 variants if you use them

Time & risk

Duration: ~20–30 minutes (includes model download)
Risk level: Low
- Large model downloads can fail if network connectivity is unstable
- Older Ollama versions will not load newer models
Rollback: Stop Ollama and delete the downloaded model from ~/.ollama/models
Last Updated: 03/06/2026
- Model set to glm-4.7-flash; Ollama 0.15.0+; cleanup order and docs refresh

Claude Code

Step 1. Confirm your environment

Description: Verify the GPU is visible before installing anything.

nvidia-smi

Expected output (example): A table showing driver version and GPU(s). On DGX Station, the GPU name may appear as NVIDIA GB300 (without "Ultra"):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 5xx.xx    Driver Version: 5xx.xx    CUDA Version: 12.x          |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
|   0  NVIDIA GB300        On   | 00000000:06:00.0 Off |                    0 |
...

Step 2. Install or update Ollama

Description: Install Ollama or ensure it is recent enough for modern coding models.

curl -fsSL https://ollama.com/install.sh | sh
ollama --version

To install a specific version (e.g. 0.15.0 or newer, required for GLM-4.7-Flash):

curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.15.0 sh

If Ollama is already present and the version is 0.15.0 or newer, simply run:

ollama --version

Expected output (example):

ollama version is 0.15.0

Step 3. Pull a coding model

Description: Download the model weights to your DGX Station. This playbook supports two model options on Ollama; choose one (or both) depending on whether you want fast loading and testing or best quality.

For fast loading and testing — glm-4.7-flash (~19 GB for latest; loads quickly; ensure Ollama 0.15.0+):

ollama pull glm-4.7-flash

For best quality — unsloth/GLM-4.7-GGUF:Q8_0 from Hugging Face (larger, higher quality; supported on Ollama):

ollama pull hf.co/unsloth/GLM-4.7-GGUF:Q8_0

Other glm-4.7-flash variants on GB300 (more GPU memory; bf16 is ~60 GB):

ollama pull glm-4.7-flash:q8_0
ollama pull glm-4.7-flash:bf16

Expected output (example): Progress lines followed by "success" and the model in ollama list:

ollama list

NAME                                ID              SIZE    MODIFIED
glm-4.7-flash:latest                abc123...       19 GB   1 minute ago
unsloth/GLM-4.7-GGUF:Q8_0           def456...       ...    ...

Step 4. Test local inference

Description: Run a quick prompt to confirm the model loads. Use the same model name you pulled (e.g. glm-4.7-flash for fast testing, or hf.co/unsloth/GLM-4.7-GGUF:Q8_0 for best quality).

ollama run glm-4.7-flash

Or, if you pulled the larger model:

ollama run hf.co/unsloth/GLM-4.7-GGUF:Q8_0

Try a prompt like:

Write a short README checklist for a Python project.

Expected output: GLM-4.7-Flash may show Thinking... and reasoning text before the final answer, then the model's response. This is normal; wait for the reply to complete.

Exit the Ollama REPL when done: type /bye or press Ctrl+D.

Step 5. Install Claude Code

Description: Install the CLI tool that will drive the local model.

curl -fsSL https://claude.ai/install.sh | sh

Verify the installation:

claude --version

Expected output (example): A version string such as claude 0.x.x or similar. If you see claude: command not found, ensure the install script added the CLI to your PATH (e.g. restart the terminal or source your shell profile); see Troubleshooting.

Step 6. Increase context length (optional)

Description: Ollama defaults to a 4096 token context length. For coding agents and larger codebases, set it to 64K tokens. This increases memory usage. For more details on configuring context length and other parameters, see the Ollama documentation (context window and runtime options).

Set the context length per session in the Ollama REPL (use the same model name you pulled, e.g. glm-4.7-flash or hf.co/unsloth/GLM-4.7-GGUF:Q8_0):

ollama run glm-4.7-flash

Then, in the Ollama prompt:

/set parameter num_ctx 64000

Exit when done: type /bye or press Ctrl+D.

Optional method (set globally when serving Ollama):

sudo systemctl stop ollama
OLLAMA_CONTEXT_LENGTH=64000 ollama serve

Keep this terminal open and run the next step in a new terminal.

Step 7. Connect Claude Code to Ollama

Description: Point Claude Code to the local Ollama server and launch it. Use the model you pulled: glm-4.7-flash (fast) or hf.co/unsloth/GLM-4.7-GGUF:Q8_0 (best quality).

export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_BASE_URL=http://localhost:11434

claude --model glm-4.7-flash

If you are using the larger model:

claude --model hf.co/unsloth/GLM-4.7-GGUF:Q8_0

ANTHROPIC_AUTH_TOKEN=ollama: Claude Code treats the literal value ollama as a special token that means "use the local Ollama backend" instead of Anthropic's cloud API. No real API key is needed when using Ollama.
ANTHROPIC_BASE_URL: Tells Claude Code to send requests to your local Ollama server at port 11434.

Persist these variables (optional) so you don't have to re-export every terminal session. Add to ~/.bashrc or your shell profile (e.g. ~/.zshrc):

echo 'export ANTHROPIC_AUTH_TOKEN=ollama' >> ~/.bashrc
echo 'export ANTHROPIC_BASE_URL=http://localhost:11434' >> ~/.bashrc
source ~/.bashrc

Expected output: Claude Code starts and uses the local model.

Exit Claude Code when done: type /exit or press Ctrl+C.

Step 8. Complete a small coding task

Description: Create a tiny repo and let Claude Code implement a function and tests.

mkdir -p ~/cli-agent-demo
cd ~/cli-agent-demo

printf 'def add(a, b):\n    """Return the sum of a and b."""\n    pass\n' > math_utils.py
printf 'import math_utils\n\n\ndef test_add():\n    assert math_utils.add(1, 2) == 3\n' > test_math_utils.py

If you do not already have pytest installed:

python -m pip install -U pytest

In Claude Code, enter:

Please implement add() in math_utils.py and make sure the test passes.

Exit Claude Code when finished: type /exit or press Ctrl+C, then run the test:

python -m pytest -q

Expected output should show the test passing.

Step 9. Cleanup and rollback

Description: Remove the model and stop the Ollama service if you no longer need them. Remove the model first (while the Ollama server is running), then stop the service.

Warning

The following removes the downloaded model files from disk.

1. Remove the model (Ollama must be running). Use the same name you pulled:

ollama rm glm-4.7-flash

Or, for the Hugging Face model:

ollama rm hf.co/unsloth/GLM-4.7-GGUF:Q8_0

Use the exact tag you pulled (e.g. glm-4.7-flash:bf16 if you used that variant).

2. Stop the Ollama service:

sudo systemctl stop ollama

Step 10. Next steps

Fast loading and testing: use glm-4.7-flash for quick iteration and smaller downloads.
Best quality: use unsloth/GLM-4.7-GGUF:Q8_0 (Hugging Face on Ollama) or glm-4.7-flash high-quality variants (glm-4.7-flash:bf16, glm-4.7-flash:q8_0) on DGX Station (NVIDIA GB300).
Use larger context (e.g. 64K–198K) for big codebases.
Use Claude Code on multi-file refactors or test-generation tasks.

Troubleshooting

Symptom	Cause	Fix
`ollama: command not found`	Ollama not installed or PATH not updated	Rerun `curl -fsSL https://ollama.com/install.sh
Model load fails with version error	Ollama is older than 0.15.0	Update Ollama to 0.15.0 or newer (required for GLM-4.7-Flash). Do not pin to 0.14.3.
`model not found` in Claude Code	Model was not pulled	Run `ollama pull glm-4.7-flash` or `ollama pull hf.co/unsloth/GLM-4.7-GGUF:Q8_0` and retry. Use the same model name in `claude --model ...`.
`connection refused` to localhost:11434	Ollama service not running	Start with `ollama serve` or `sudo systemctl start ollama`
Slow responses or OOM	Insufficient GPU memory or fragmentation	On DGX Station (NVIDIA GB300), ensure no other heavy GPU workloads. If OOM persists, use a smaller variant (e.g. `glm-4.7-flash:q8_0` or `glm-4.7-flash:q4_K_M`) or `OLLAMA_MAX_LOADED_MODELS=1`.
`claude: command not found` after install	CLI not on PATH or install script did not complete	Restart the terminal or run `source ~/.bashrc` (or your shell profile). Check the install script output for the install path and add it to PATH.
Claude Code install fails (Node.js / network)	Node.js missing or install script cannot download	Ensure Node.js is installed (`node --version`). If the install script fails with a network error, retry from a stable connection or download the Claude Code CLI from the official site. See Claude Code documentation for alternatives.

Note

DGX Station with NVIDIA GB300 provides ample GPU memory for glm-4.7-flash (fast testing) and unsloth/GLM-4.7-GGUF:Q8_0 (best quality), plus variants (e.g. glm-4.7-flash:bf16). Use OLLAMA_MAX_LOADED_MODELS=1 if you hit memory limits with multiple models.

README.md Unescape Escape