14 KiB
CLI Coding Agent
Build local CLI coding agents with Ollama
Table of Contents
Overview
Basic idea
Use Ollama on DGX Spark to run local coding models and connect a CLI coding agent. This playbook supports three options: Claude Code, OpenCode, and Codex CLI. Each agent talks to Ollama for local inference, so you can work without external cloud APIs.
Choose your CLI agent
Pick the tab that matches the CLI agent you want to use:
- Claude Code: Fastest path to a working CLI agent with a local Ollama model.
- OpenCode: Open-source CLI with provider configuration; this guide targets Ollama.
- Codex CLI: OpenAI Codex CLI configured to run against Ollama locally.
What you'll accomplish
You will run a local coding model on your DGX Spark with Ollama, connect it to your chosen CLI agent, and complete a small coding task end-to-end.
What to know before starting
- Comfort with Linux command line basics
- Experience running terminal-based tools and editors
- Familiarity with Python for the short coding task
Prerequisites
- DGX Spark access with NVIDIA DGX OS 7.3.1 (Ubuntu 24.04.3 LTS base)
- Internet access to download model weights
- Ollama 0.14.3 or newer
- GPU memory depends on the model you choose. Example requirements for GLM-4.7-Flash:
- 19GB+ for
glm-4.7-flash:latest - 32GB+ for
glm-4.7-flash:q8_0 - 60GB+ for
glm-4.7-flash:bf16
- 19GB+ for
Time & risk
- Duration: ~20-30 minutes (includes model download time)
- Risk level: Low
- Large model downloads can fail if network connectivity is unstable
- Older Ollama versions will not load the model
- Rollback: Stop Ollama and delete the downloaded model from
~/.ollama/models - Last Updated: 01/21/2026
- First publication
Claude Code
Step 1. Confirm your environment
Description: Verify the OS version and GPU are visible before installing anything.
cat /etc/os-release | head -n 2
nvidia-smi
Expected output should show Ubuntu 24.04.3 LTS (DGX OS 7.3.1 base) and a detected GPU.
Step 2. Install or update Ollama
Description: Install Ollama or ensure it is recent enough for modern coding models.
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.14.3 sh
ollama --version
If the ollama is already present and the version is 0.14.3 or newer, simply run:
ollama --version
Expected output should show ollama --version as 0.14.3 or newer.
Step 3. Pull GLM-4.7-Flash
Description: Download the model weights to your Spark node.
ollama pull glm-4.7-flash
Optional variants if you need different memory footprints:
ollama pull glm-4.7-flash:q4_K_M
ollama pull glm-4.7-flash:q8_0
ollama pull glm-4.7-flash:bf16
Expected output should show glm-4.7-flash (and any optional variants you pulled) in ollama list.
Step 4. Test local inference
Description: Run a quick prompt to confirm the model loads.
ollama run glm-4.7-flash
Try a prompt like:
Write a short README checklist for a Python project.
Expected output should show the model responding in the terminal.
Step 5. Install Claude Code
Description: Install the CLI tool that will drive the local model.
curl -fsSL https://claude.ai/install.sh | sh
Step 6. Increase context length (optional)
Description: Ollama defaults to a 4096 token context length. For coding agents and larger codebases, set it to 64K tokens. This increases memory usage. For more details on configuring context length, see the Ollama documentation.
Set the context length per session in the Ollama REPL:
ollama run glm-4.7-flash
Then, in the Ollama prompt:
/set parameter num_ctx 64000
Optional method (set globally when serving Ollama):
sudo systemctl stop ollama
OLLAMA_CONTEXT_LENGTH=64000 ollama serve
Keep this terminal open and run the next step in a new terminal.
Step 7. Connect Claude Code to Ollama
Description: Point Claude Code to the local Ollama server and launch it.
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_BASE_URL=http://localhost:11434
claude --model glm-4.7-flash
Expected output should show Claude Code starting and using the local model.
Step 8. Complete a small coding task
Description: Create a tiny repo and let Claude Code implement a function and tests.
mkdir -p ~/cli-agent-demo
cd ~/cli-agent-demo
printf 'def add(a, b):\n """Return the sum of a and b."""\n pass\n' > math_utils.py
printf 'import math_utils\n\n\ndef test_add():\n assert math_utils.add(1, 2) == 3\n' > test_math_utils.py
If you do not already have pytest installed:
python -m pip install -U pytest
In Claude Code:
Please implement add() in math_utils.py and make sure the test passes.
Run the test:
python -m pytest -q
Expected output should show the test passing.
Step 9. Cleanup and rollback
Description: Remove the model and stop services if you no longer need them.
To stop the service:
sudo systemctl stop ollama
Warning
This will delete the downloaded model files.
ollama rm glm-4.7-flash
Step 10. Next steps
- Try larger code tasks with the 198K context window
- Experiment with
glm-4.7-flash:q8_0orglm-4.7-flash:bf16for higher quality - Use Claude Code on multi-file refactors or test-generation tasks
OpenCode
Step 1. Confirm your environment
Description: Verify the OS version and GPU are visible before installing anything.
cat /etc/os-release | head -n 2
nvidia-smi
Expected output should show Ubuntu 24.04.3 LTS (DGX OS 7.3.1 base) and a detected GPU.
Step 2. Install or update Ollama
Description: Install Ollama or ensure it is recent enough for modern coding models.
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.14.3 sh
ollama --version
If Ollama is already installed and the version is 0.14.3 or newer, simply run:
ollama --version
Expected output should show ollama --version as 0.14.3 or newer.
Step 3. Pull a coding model
Description: Download a local coding model to your Spark node.
ollama pull glm-4.7-flash
Optional variants if you need different memory footprints:
ollama pull glm-4.7-flash:q4_K_M
ollama pull glm-4.7-flash:q8_0
ollama pull glm-4.7-flash:bf16
Expected output should show your model in ollama list.
Step 4. Install OpenCode
Description: Install the OpenCode CLI using the official Linux instructions.
Follow the install guide at https://opencode.ai/docs, then verify:
opencode --version
Step 5. Configure OpenCode to use Ollama
Description: Point OpenCode to your local Ollama server with an opencode.json.
Create opencode.json in your project directory (or the location you prefer for OpenCode config):
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ollama": {
"npm": "@ai-sdk/openai-compatible",
"name": "Ollama (local)",
"options": {
"baseURL": "http://localhost:11434/v1"
},
"models": {
"glm-4.7-flash": {
"name": "glm-4.7-flash"
}
}
}
}
}
Replace glm-4.7-flash with the model you pulled. If Ollama is running on another host,
update the baseURL accordingly.
Step 6. Increase context length (optional)
Description: Ollama defaults to a 4096 token context length. For coding agents and larger codebases, set it to 64K tokens. This increases memory usage. For more details, see the Ollama documentation.
Set the context length per session in the Ollama REPL:
ollama run glm-4.7-flash
Then, in the Ollama prompt:
/set parameter num_ctx 64000
Optional method (set globally when serving Ollama):
sudo systemctl stop ollama
OLLAMA_CONTEXT_LENGTH=64000 ollama serve
Keep this terminal open and run the next step in a new terminal.
Step 7. Launch OpenCode
Description: Start the OpenCode CLI and select the Ollama provider and model.
opencode
If prompted, select the Ollama provider and the model you configured.
Step 8. Complete a small coding task
Description: Create a tiny repo and let OpenCode implement a function and tests.
mkdir -p ~/cli-agent-demo
cd ~/cli-agent-demo
printf 'def add(a, b):\n """Return the sum of a and b."""\n pass\n' > math_utils.py
printf 'import math_utils\n\n\ndef test_add():\n assert math_utils.add(1, 2) == 3\n' > test_math_utils.py
If you do not already have pytest installed:
python -m pip install -U pytest
In OpenCode:
Please implement add() in math_utils.py and make sure the test passes.
Run the test:
python -m pytest -q
Expected output should show the test passing.
Step 9. Cleanup and rollback
Description: Remove the model and stop services if you no longer need them.
To stop the service:
sudo systemctl stop ollama
Warning
This will delete the downloaded model files.
ollama rm glm-4.7-flash
Step 10. Next steps
- Try other coding models available in Ollama
- Experiment with higher context lengths for larger refactors
- Use OpenCode on multi-file changes or test-generation tasks
Codex CLI
Step 1. Confirm your environment
Description: Verify the OS version and GPU are visible before installing anything.
cat /etc/os-release | head -n 2
nvidia-smi
Expected output should show Ubuntu 24.04.3 LTS (DGX OS 7.3.1 base) and a detected GPU.
Step 2. Install or update Ollama
Description: Install Ollama or ensure it is recent enough for modern coding models.
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.14.3 sh
ollama --version
If Ollama is already installed and the version is 0.14.3 or newer, simply run:
ollama --version
Expected output should show ollama --version as 0.14.3 or newer.
Step 3. Install Codex CLI
Description: Install the Codex CLI.
npm install -g @openai/codex
codex --version
Step 4. Start Codex with Ollama
Description: Launch Codex with the OSS flag to use Ollama.
codex --oss
By default, Codex uses the local gpt-oss:20b model.
Step 5. Optional settings
Description: Adjust the model or context length if needed.
To use GLM-4.7-Flash with Codex, pull the model and start Codex with -m:
ollama pull glm-4.7-flash
codex --oss -m glm-4.7-flash
To switch to other models, use the -m flag:
codex --oss -m gpt-oss:120b
To use a cloud model:
codex --oss -m gpt-oss:120b-cloud
Codex works best with a large context window. We recommend 64K tokens. For more details, see the Ollama documentation.
Set the context length per session in the Ollama REPL:
ollama run glm-4.7-flash
Then, in the Ollama prompt:
/set parameter num_ctx 64000
Optional method (set globally when serving Ollama):
sudo systemctl stop ollama
OLLAMA_CONTEXT_LENGTH=64000 ollama serve
Replace glm-4.7-flash with the model you are using (for example, gpt-oss:20b).
Keep this terminal open and run the next step in a new terminal.
Step 6. Advanced configuration (optional)
Description: Set defaults or point Codex at a remote Ollama server.
Create or edit ~/.codex/config.toml:
model = "glm-4.7-flash"
model_provider = "ollama"
[model_providers.ollama]
base_url = "http://localhost:11434/v1"
If Ollama is running on another host, update the base_url accordingly. You can set
model to any Ollama model you want Codex to use.
Step 7. Complete a small coding task
Description: Create a tiny repo and let Codex implement a function and tests.
mkdir -p ~/cli-agent-demo
cd ~/cli-agent-demo
printf 'def add(a, b):\n """Return the sum of a and b."""\n pass\n' > math_utils.py
printf 'import math_utils\n\n\ndef test_add():\n assert math_utils.add(1, 2) == 3\n' > test_math_utils.py
If you do not already have pytest installed:
python -m pip install -U pytest
In Codex:
Please implement add() in math_utils.py and make sure the test passes.
Run the test:
python -m pytest -q
Expected output should show the test passing.
Step 8. Cleanup and rollback
Description: Remove the model and stop services if you no longer need them.
To stop the service:
sudo systemctl stop ollama
Warning
This will delete the downloaded model files.
ollama rm gpt-oss:20b
Replace gpt-oss:20b with the model you used.
Step 9. Next steps
- Try other Ollama coding models with Codex CLI
- Experiment with higher context lengths for larger refactors
- Use Codex CLI on multi-file changes or test-generation tasks
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
ollama: command not found |
Ollama not installed or PATH not updated | Rerun `curl -fsSL https://ollama.com/install.sh |
| Model load fails with version error | Ollama is older than 0.14.3 | Update Ollama to 0.14.3 or newer |
model not found in Claude Code |
Model was not pulled | Run ollama pull glm-4.7-flash and retry |
opencode: command not found |
OpenCode not installed or PATH not updated | Install OpenCode and open a new shell |
| OpenCode cannot reach Ollama | baseURL misconfigured or Ollama not running |
Set baseURL to http://localhost:11434/v1 and start Ollama |
codex: command not found |
Codex CLI not installed or PATH not updated | Install Codex CLI and open a new shell |
| Codex CLI uses the wrong model/provider | ~/.codex/config.toml not pointing to Ollama |
Set model_provider = "ollama" and base_url = "http://localhost:11434/v1" |
connection refused to localhost:11434 |
Ollama service not running | Start with ollama serve or systemctl start ollama |
| Slow responses or OOM errors | Model variant too large for GPU memory | Use glm-4.7-flash:q4_K_M or close other GPU workloads |
Note
DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. If you see memory pressure, flush the buffer cache with:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'